Microservices Challenges — Distributed Troubleshooting and Monitoring

Photo by Susan Yin on Unsplash

Microservices architecture bring us so many great benefits, as we discussed here. However, nothing comes free and so the benefits of Microservices. There are various challenges which are by-product of microservices, and it makes all sense to understand these well before beginning journey in new fancy world.

One of these challenges is the complexity of troubleshooting which comes with microservices pattern. We shall discuss this and its possible solution in this article.

Challenge

Troubleshooting is part of development life. Life is comparatively easy when debugging is for one monolith system. But microservices world has many smaller services, which interacts with each other in mix of combinations to support bigger application features. With such a dynamic interaction matrix, troubleshooting for any issue could span multitude of services. Finding trail of any transaction across multiple services and multiple servers, which could again be dynamic based on business logic of each service, could be killing. It makes troubleshooting overly complex, which needs deep system knowledge and persistence to walk through with many many nodes.

But is it really doable manually. Especially, when this complexity could be multiplied with auto-scalable container based deployment models, which means server nodes are changing dynamically based on need.

Definitely, it is not manageable without proper tooling in place. Even, then it is complex to understand complex service request trail.

Solution

Distributed Tracing, Log Aggregation and Visualization are the savior.

Distributed tracing: It means enabling to find the related logs across services and nodes. It can be done by adding a identifier to logs which can help tracing (correlating) the request across the services and server nodes. It is like adding one constant at the beginning of first service hit, which will then be passed to all services calls from there. Every service, if these are also enabled with distributed tracing, mostly automatically pick this constant trace id and use it while logging their data also. It means, logs across the services now has something common to correlated logs for one request.

Log Aggregator: Now, as we have a identifier to identify one request across the services. Next challenge is that, it is not possible to login in different servers and analyze the logs. Hence, next is to collect and aggregate all the logs in some common store so trails can be analyzed across the services. Log Aggregation tools help in this. Every service, which use tracing id and prepare the logging data, also send this data to one centralized log aggregator service using some connector.

Log Visualizer: Log aggregator store these logs for analysis later in some search friendly scalable database to store bulk of log entries and faster search. Now, this data can be used to present the whole request trail in some easily understandable format at UI.

Once we have this data, it opens many opportunities, ex:

  • See request trails across the services for troubleshooting
  • Analyze the performance data i.e. time taken by each service/operation
  • Client usage patterns or user behavior, i.e. which service is used more and what time.
  • Potential of connect this data with context information to do even more deep analysis, ex: which service was used more near to new year and so on.

Performance Impact

It is obvious question that logging data for each service call and operation and then sending it to centralized collector can add performance overhead. Which is true to some extent, however, there are ways to optimize it.

Adding distributed trace token and logging is usually managed by instrumentation driven libraries, which add pre/post hooks for each method to log the basic information. This information is also configurable. As this logging is managed by libraries and mostly abstracted, these libraries are highly optimized to do this job.

Sending logging data to centralized servers is optimized by making it asynchronous, ex: Log Reporting Client on each node can have a local in-memory (and physical too) storage which can collate all the logs. Collected logs can later by send to centralized server asynchronously using http requests or queues etc. This manage the reporting overhead a lot.

Cost can not be zero for benefits, however, it can be well contained for most of the cases. And as tracing libraries are highly configuration or log level and reporting mechanism, these provide various options to optimize the process based on each application need.

Tech to Support

There are many libraries and tools options to enable these.

Zipkin & Jaeger are two most popular Distributed Tracing System available as of now. Zipkin was developed by Twitter, and Jaeger by Uber.

  • Both were open sourced, once these reach to production readiness stage, by respective companies with their respective supporting open source communities.
  • Both supports the basic components of Distributed tracing / Log aggregator or collector / Visualizer, as described above. Additionally, these supports many more useful features on top, which are specific to each tool.
  • Both are quite similar in architecture and usage. However differs mostly in deployment style and plugins available for different languages and framework.

Refer to respective websites for detailed information. These are well documented tools.

Refer to opentracing also, which is on mission to standardize the tracing APIs and tools. This is also an incubating project in Cloud Native Computing Foundation.

Conclusion

Combination of these tools and design patterns makes distributed troubleshooting and monitoring possible in microservice ecosystem. Keep in mind, it is still complex than monolith models. It still needs deep system knowledge across the services and ability to connect all the dots across. It needs experience in system, and breadth of system understanding. However, combination of these tools make it feasible.

Hence, if you are not ready for iota of extra complexity, it is recommended to stay away from microservices kind of distributed architecture patterns where number of services could be high due to fine grained structure. Or start with right tooling in place to manage it well since beginning. Don’t ever think to run the highly distributed microservice kind of systems without these supporting tools. Or you will be spending months of efforts on troubleshooting, with high potential to burn the teams physically and morally both, and may lose on productivity and cost.

We shall discuss about more challenges like distributed transactions, and managing failures in distributed environment in future articles.

Till then, Happy Learning and stay tuned..

Enjoy building great teams and products, a learner, an explorer of matrix

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store