Systems and applications alike have become progressively distributed as microservices, open-source tools, and containerisation have gained traction. In order to actively monitor and respond quickly to issues that arise in our environment, distributed tracing has proven to be vital for businesses such as Uber, Postmates, Hello Fresh and TransferWise.
It is, however, important to clarify what distributed tracing actually means. In this article, we'll examine how you can gain observability into a highly distributed architecture and learn more about why tracing matters.
What Are Traces?
The most practical way to understand traces is to think of them as contextualised logs. A trace offers us a way to follow the progress of a transaction or workflow as it moves through a system.
In order to diagnose the root cause of performance issues, trace spans can help us examine the performance of a single request. A service map, on the other hand, offers an end-to-end view, making it easier to isolate issues within services. Monitoring performance and identifying issues early can be accomplished by using trace groups.
A number of services collect trace data, such as the Amazon Kinesis agent, CloudWatch agent, Beats, Fluent Bit, and FluentD, for example.
It is also possible to aggregate traces using a number of different tools. A few of these tools include Amazon Kinesis Data Firehose, Amazon Managed Streaming for Kafka, Amazon Simple Storage Service, and Logstash.
The most well-known producers of application trace data include Jaeger, Zipkin, and OpenTelemetry. In the collector part of the pipeline, OpenTelemetry performs the same function as Fluentd. The equivalent aggregator in the process that leverages OpenTelemetry is the OpenSearch Data Prepper.
What Is Distributed Tracing?
The use of distributed tracing has become increasingly common in troubleshooting performance issues and errors. The monitoring of transaction level tracking is made possible through tracing.
The feeling of having to find a "needle in a haystack" is often the most difficult aspect of troubleshooting applications and services for engineers. In this situation, distributed tracing can provide a method of observing requests as they propagate through distributed applications and services.
To understand how distributed tracing can assist in performance troubleshooting, it is helpful to first mention the types of performance problems that are commonly faced. These are frequently issues with latencies and compute utilisation of your services, which are problems that can easily end up affecting both your bottom line and user experience negatively.
It is often difficult to diagnose issues such as these because there is no clear error or event to alert upon. Alternatively, this problem could have arisen from an issue that has crept up in severity unexpectedly or is otherwise buried in your system.
In large distributed systems, performance problems can be unexpected, and causality can be difficult to determine. In many cases, the problem remains undetected or is buried under various telemetry signals.
This is where distributed tracing, by its nature, can observe the performance of a trace over time and can aid in offering a solution to detecting otherwise hidden problems affecting your systems.
Why Would You Need To Use Distributed Tracing?
It is very difficult for engineers to determine intuitively which service in a system’s architecture is going to have the most impact on the user's latency when you have a complicated and distributed system. In the absence of tracing and harnessing trace data, you cannot answer that question at all, let alone take action.
Traces can be used to determine the latencies for different services. If you are interested in attempting to speed up a distributed system's infrastructure, distributed tracing can prove to be a valuable tool for determining where to focus your attention.
If we were to also look at a scenario where we are using a number of microservices to accomplish a single objective within an application, we would find that this type of application architecture often becomes very challenging to troubleshoot, especially if you are using a number of microservices simultaneously.
Using distributed tracing, you are able to pinpoint exactly how each request has moved across this system. This greatly simplifies the process of troubleshooting and identifying issues within the architecture.
By creating a trace monitoring dashboard, we can have a central reference point. This will enable us to refer back whenever we want to follow the progress of traces in the system.
Using a feature like Example Traces, for example, can be an effective way of finding the reason behind the latency. As a result of using Example Traces, we are able to overlay sampled traces onto a latency heat map that can be viewed within the monitoring dashboard.
As well as grouping similar traces together, Trace Groups can also be used for monitoring the performance of each trace. This will also help us identify problems before they become serious issues.
Tracing is a component of monitoring but is not the entire monitoring solution as it cannot be used solely by itself. Gaining the full value from trace data requires the right context and the right traces. Distributed tracing isn't the only problem occurring in your system. This is why we need a full observability platform that also handles logs and metrics alongside tracing.
Observability aims to navigate and discover the unknown unknowns, while monitoring is more focused on failures that can be anticipated and defined in advance.
The fully integrated approach will include health-check endpoints that tell monitoring tools how the applications are performing, application and system metrics that can be used to alert people to anomalies, timestamped logs that provide context to help determine the root cause of problems, and context-rich tracing that can help correlate and debug distributed call traces to pinpoint the exact source of errors and latencies.
Examples Of Distributed Tracing Tools
The Google Operations Suite's Cloud Trace tool is a distributed tracing solution that is particularly handy for analysing latency scenarios. However, it is also useful in other scenarios where a downstream service is affecting the upstream behaviour.
You can find exactly what line of your code is causing an issue with the GCP Cloud Profiler, a continuous profiling application for the Google Cloud Platform. In production, Cloud Profiler can be used to investigate which functions in your service's code are slowing it down. Cloud Profiler displays your service profile as a plane graph by default. With this view, you can see where the service spends most of its time or lags.
The Logit.io platform also provides distributed tracing for users that wish to explore how they can quickly discover issues and resolve them faster. Feel free to explore this use case further with a free 14-day platform trial.