Get a DemoStart Free TrialSign In

Resources

5 min read

In the context of application performance monitoring (APM) and observability, traces and spans are fundamental concepts that help users to track and understand the flow of requests and operations within a system. They are essential in assisting users to identify bottlenecks, troubleshoot issues, and optimize application performance.

In our latest guide covering everything you need to know about application performance monitoring and observability, we are exploring in detail the key differences between traces & spans and how they aid users in achieving observability.

Contents

In the context of application performance monitoring (APM) and observability, traces and spans are fundamental concepts that help users to track and understand the flow of requests and operations within a system. They are essential in assisting users to identify bottlenecks, troubleshoot issues, and optimize application performance.

In our latest guide covering everything you need to know about application performance monitoring and observability, we are exploring in detail the key differences between traces & spans and how they aid users in achieving observability.

Traces vs Spans

Understanding how your software works and detecting and mitigating performance issues has always been a fundamental part of application performance management and monitoring (also known commonly as APM). But within the context of tracing this involves capturing and recording the flow of transactions and interactions within an application. This helps us to identify performance bottlenecks, errors, and other issues that may impact the application's performance and user experience.

Within this article, we will discuss what traces and spans are and the differences between them. As well as, distributed tracing, all within the context of APM. In order to assist your ability to act quickly when issues arise within your application.

What are traces?

A trace represents the whole journey of a request or an action as it moves through all the nodes of a distributed system, especially containerized applications or microservices architectures. As part of APM, traces help developers and operations engineers alike in understanding the end-to-end journey of a request, from its initial entry point to all the different services it touches, and finally to its completion.

Tracing forms part of a monitoring solution. It’s not the monitoring solution in its entirety, because you have to identify the correct traces with the right context to drive value from them.

What are spans?

A span is an operation or ‘work’ taking place on a service. An example of this would be a web server responding to an HTTP request or a single invocation of a function. A span has a start time and an end time. A series of tagged time intervals, known as spans, form a single trace in distributed tracing.

Spans have a parent-child relationship, with each span sprawling out in a tree-like structure, in this analogy the tree would be a trace and the branches would be spans. A parent span, also known as a root span encapsulates the end-to-end latency of an entire request. A child span is triggered by a parent span and can be a function call, database calls, calls to another service, etc. Combining all the spans in a trace can give you a detailed idea of how the request performed across its entire lifecycle.

Distributed Tracing

A trace is a group of transactions and spans with a common root. Each trace tracks the entirety of a single request. When a trace travels through multiple services, as is common in a microservice architecture, it is known as a distributed trace.

Distributed tracing helps teams identify the root cause of application performance issues faster and aids in conducting timely troubleshooting as well as improved MTTR (mean time to resolution). Distributed tracing helps identify performance bottlenecks, latency issues, and errors that might occur across the distributed system. This information is invaluable for optimizing the application's performance and ensuring uptime is maintained.

Because distributed tracing highlights the exact areas where issues lie, this analysis also helps boost efficacy across teams. This improves the working relationships that are crucial for both timely troubleshooting and delivering innovations that grow the business. Due to this, organizations can gain a competitive advantage as they’re able to get new products and services to market faster as a result.

Traces vs Spans

Traces and spans are two fundamental concepts in the realm of distributed tracing. To delve deeper into this we will break down the two terms in the context of APM. Firstly, as mentioned previously, a trace consists of multiple spans that together form a complete journey of the transaction as it moves through various components and services. Traces are used to understand the overall flow of a request and the interactions between different services and components. However, to derive value from traces, the context is essential.

Also, when analyzing traces, there can be great volumes of data to trawl through, especially when your trace involves interactions with multiple services or components. For example, if the majority of all the requests are successful 200s and finish without unacceptable latency or errors, you don’t really need all that data. This means you don’t always need a ton of data to find the right insights. You just need the right sampling of data. Therefore, it’s paramount to employ a sampling technique that will discard all the useless information and only highlight the traces to be sent to the tracing backend, that you can and should act on. The benefits of employing sampling are, that you’ll save time and financial resources. You can focus on traces of interest, for example, your frontend team may only want to see traces with specific user attributes. As well as, being able to filter out noise.

Shifting the focus to spans, we know that a span represents a single operation within a trace and that they have a parent-child relationship. A span typically contains the following information:

  • Operation Name: A descriptive name of the operation or event represented by the span. For example, a span may be labeled as ‘Database Query’.
  • Start and End Timestamps: The timestamp represents when the span's operation started and when it was completed. By measuring the time between these timestamps, APM tools can calculate the duration of the operation, helping to identify potential performance hotspots.
  • Metadata: Additional context and metadata associated with the span, such as error status, log messages, tags, and key-value pairs providing extra details about the operation.
  • Parent Span ID: A reference to the ID of the span that initiated or triggered the current span. This parent-child relationship helps APM systems reconstruct the complete transaction flow and visualize the dependencies between different spans.
  • Unique Span ID: An identifier that uniquely identifies the span within the trace. This ID allows APM systems to distinguish one span from another within the same trace.

Finally, spans are essential for analyzing the latency or response time of individual operations within a transaction. Each span records its start and end timestamps, allowing APM systems to calculate the latency of that particular operation.

Analyzing the latency in spans provides a granular view of how much time each individual operation takes within a transaction. Using granular performance analysis allows developers and operations teams to pinpoint which specific parts of the transaction are causing delays or contributing to overall performance issues. Also, application performance monitoring tools (such as the one provided by Logit.io) often provide latency histograms that show the distribution of latencies across spans. These histograms can help visualize the range of response times and identify spans with unusually high latencies. Lastly, alerts can be configured when the latency exceeds a specific threshold to ensure your team acts promptly when these issues arise.

If you found this article on observability to be informative then why not read all about the OpenTelemetry collector or what is observability?

Get the latest elastic Stack & logging resources when you subscribe

© 2024 Logit.io Ltd, All rights reserved.