Visualising the MicroServices Timings in a Waterfall Chart

5 min readMay 6, 2020

Picture speaks more than the words.

Image Credit: https://www.flickr.com/photos/yasmine-hens/28322602168/

Setting the Context

In this article we will be talking about the visualisation of the performance timings in the microservices environment which is not at all an easy task specially when these services are highly reactive and asynchronous.

Let’s take an example of a typical micro services architecture to set the context, where services are running inside a containerised environment.

Let’s see, what will be the challenges here to monitor the performance of each micro service.

As you can see below our performance testing client will sit outside the cluster to mimic the actual user and our end-to-end response timings will involve the complete round trip.

As seen in the architecture, the purpose of the Workflow Orchestrator service is to take the request from the API Gateway and fire it to all the micro services asynchronously at the same time. Once the response comes back from all the services, orchestrator service will collate them and send it back to the gateway.

Performance Testing Requirements:

To get the complete round trip timings for the end to end transaction from the client point of view.
Services A to E will do all the heavy lifting to complete the transaction so it is essential to monitor the time taken by each of these micro services.
Comparing the the timing of each microservice and see how much percentage of the total transaction time is contributed by each microservice.

Performance Metrics Collection:

To collect the round trip timing of the transaction, we can rely on the performance testing tool which can provide this number very easily.

Getting the response timing of each microservice will be more challenging task as it would require assistance of APM tools. There are many APM tools like Newrelic and Appdynamics which can be used to get these timings.

If we are not fortunate enough to have these tools in place then we can also use any logging system like ELK to get these timings by firing few elastic search queries.

Visualization of the timings in a Simple Bar Graph:

Let’s assume we are able to get all the micro services timings for an outlier which took 10.5 seconds in total and try to plot them in a simple Bar graph.

As you can see in the below graph the maximum time taken by the service F which is 5.6 seconds but still the total transaction time is 10.5 seconds. That is quite weird because we know that all services are fired in parallel at the same moment.

In my case there were many outliers like these specially when we increase the concurrency to more than 100 concurrent users. Interestingly, for lesser number of users like 10, the total transaction time was comparable to other services timings as shown in this graph.

How to proceed further and delve more deeper in to this issue ?

Now, we would need to validate the fact that the services are getting fired at the same moment which we took as an assumption till now as the service developer has shared this aspect of the implementation.

To do that, we would need to capture the start time and end time of the each microservice transaction by mining the application logs by firing few elastic search queries.

Visualization of the timings in a Waterfall Graph:

Now, the biggest challenge was, how to visualize this data where we have to visualize the relative timings of each microservice transaction over the timeline of the overall end to end transaction.
I tried to go through lot of graphs which can show this kind of data and then I realised that the chrome devtools have similar kind of view called waterfall graph where it shows all the requests being made in parallel and where they stand against each other in a larger timeline of the transaction.

Here is a sample waterfall view from Chrome devtools bar

We would need to replicate a similar view in our scenario, so I decided to scan different visualizations tools like Kibana, Grafana, Excel etc but unfortunately none of these tools provide this waterfall chart.

Then I had to call the Javascript developer inside me and decided to build this graph from scratch using some UI library. Fortunately I found the library highcharts which offers lot of amazing graphs including the waterfall chart.

Here is the final waterfall chart for one of the outlier.

Waterfall Chart created using Highcharts library

We can clearly see that, Services E and F are getting fired when the first four services were completed due to which the total transaction time was getting increased.
This could have happened due to the Orchestrator service was getting overwhelmed on higher concurrency and because of which it is not firing all the services request at the same time.

We could not have discovered this bottleneck without visualizing the results in waterfall chart, thats why they say:

“Pictures speaks more than Words..”

Here is the complete javascript code which was used to generate the above waterfall graph.

Conclusion:

As a Performance Testing Specialist, my job never ends at executing the tests and provide the metrics table with all the required percentile numbers. In case of outliers, it is also my responsibility to provide the tooling to get more insights about those outliers so that the team can take a more informed decision.

Thanks for reading this blog.