Microservices Challenges — Distributed Troubleshooting and Monitoring

6 min readFeb 27, 2021

Microservices architecture brings us so many great benefits, as we discussed here. However, nothing comes for free and so are the benefits of Microservices. There are various challenges that are by-products of microservices, and it makes all sense to understand these well before beginning a journey in a new fancy world.

One of these challenges is the complexity of troubleshooting which comes with the microservices pattern. We shall discuss this with possible solutions in this article.

Challenge

Troubleshooting is part of development life. Life is comparatively easy when debugging is for one monolith system. But the microservices world has many smaller services, which interact with each other in a mix of combinations to support bigger application features.

With such a dynamic interaction matrix, troubleshooting for any issue could span a multitude of services. Finding a trail of any request across multiple services and servers, which could again be dynamic based on the business logic of each service, could be killing.

This makes troubleshooting overly complex, which needs deep system knowledge and persistence to walk through with many many nodes.

Once one has deep knowledge of the system, and its distributed architecture, debugging is possible. But it will still consume a lot of time and energy. Assume a scenario, when this complexity could be multiplied with auto-scalable container-based deployment models, which means server nodes are changing dynamically based on need.

It is not manageable without proper tooling in place. Even, then it is complex to understand the thousands of service request trails.

Solution

Distributed Tracing, Log Aggregation, and Visualization are the savior.

Distributed tracing

This enables to trace the related logs across services and nodes.

It is implemented by adding an identifier to logs which can help trace (by correlating) the request across the services and server nodes.

It is like adding one constant with the request at the beginning of the first service hit. This constant will then be passed to all subsequent service calls from there. Every service, if distributed tracing is enabled for these, automatically picks this constant trace id and uses it while logging respective data.

In turn, it enables to correlate the data of the same request across the services and servers easily.

Log Aggregator

Now, we have an identifier that can help to identify any given request across the services.

The next challenge is, that it is not practically possible to login into different servers and analyze the logs. Hence, we need a provision where all the logs from different services can be collated and presented in one place for easy and efficient reading.

Log Aggregation tools help in this. Every service, while using trace id for logs, prepares the logging data, and sends this data to one centralized log aggregator service (mostly using one of the available tools/libraries).

Log aggregator stores the logs in some optimized data structure, and search-friendly scalable database. Once all log data is available in one data store, it becomes possible to use this for different kinds of functions.

Log Visualizer

Log aggregators have all log data in one place. These tools/databases are meant to store the bulk log entries and to enable faster search on this.

Now, this data can be used to present the whole request trail in some easily understandable format at UI or through API too.

Once we have this aggregated data, it opens many opportunities, for example:

See request trails across the services for troubleshooting
Analyze the performance data i.e. time taken by each service/operation
Client usage patterns or user behavior, i.e. which service is used more and at what time.
Potential of connecting this data with context information to do an even more deep analysis, for example, which service was used more on the new year and so on. This can feed important information to system design decisions.

Performance Impact

It is an obvious question that logging data for each service call, and operation and then sending it to a centralized collector can add performance overhead. This is true to some extent, however, there are ways to optimize it.

Adding distributed trace tokens and logging is usually managed by instrumentation-driven libraries, which add pre/post hooks for each method to log the basic information. The amount of information is configurable.

As this logging is managed by libraries and mostly abstracted, these libraries are highly optimized to do this job.

Sending logging data to a centralized collector (aggregators) can be optimized by making it asynchronous. For example, a logging client on each node can have a local in-memory (and physical too) storage which can collate all the logs. All these logs can later be sent to a centralized server asynchronously using HTTP requests or queues etc. This reduces the reporting overhead a lot, especially from service processing flow.

Cost can not be zero for benefits, however, it can be contained in most cases. Most tracing libraries are highly configurable for log level and reporting mechanisms, these provide various options to optimize the process based on application-specific use cases.

Tech to Support

There are many libraries and tool options to enable these.

Zipkin & Jaeger are the two most popular Distributed Tracing systems available as of now. Zipkin was developed by Twitter, and Jaeger by Uber.

Both were open-sourced, once these reach to production readiness stage, by respective companies with their respective open-source communities.
Both support the basic components of Distributed tracing / Log aggregator or collector / Visualizer, as described above. Additionally, these support many more useful features on top, which are specific to each tool.
Both are quite similar in architecture and usage. However, differs mostly in deployment style and plugins available for different languages and frameworks.

Refer to respective websites for detailed information. These are well-documented tools.

More Tools

open-tracing is on a mission to standardize the tracing APIs and tools. This is also an incubating project in Cloud Native Computing Foundation.
Lightstep — Cloud-native monitoring and observability tool, with unified telemetry
Honeycomb — Observability tool for all kinds of distributed applications with deep pattern analysis capabilities

Conclusion

The combination of these tools and design patterns makes distributed troubleshooting and monitoring possible in the microservice ecosystem.

Keep in mind, that it is still more complex than monolith designs. It still needs deep system knowledge across the services and the ability to connect all the dots across. It needs experience in system design and system both. However, a combination of these tools makes it feasible.

Hence, if you are not ready for an iota of extra complexity, it is recommended to stay away from microservices kind of distributed architecture patterns. The number of services could be overwhelming in such designs due to fine-grained structure.

Or start with the right tooling in place to manage it well from the beginning. Don’t ever think to run a highly distributed microservice kind of system without these supporting tools. Otherwise, you will be spending months of effort on troubleshooting, with a high potential to burn the teams physically and morally, and may lose productivity too.

We shall discuss more challenges like distributed transactions, and managing failures in distributed environments in future articles.

Till then, Happy Learning and stay tuned…

If you enjoy reading this, please share, give a clap and follow for more stories.

If you have any suggestions, feel free to reach me on Linkedin: Mohit Gupta