Jaeger is an open-source distributed tracing platform originally created by Uber, now a Cloud Native Computing Foundation project. It is designed to monitor and troubleshoot complex, latency-sensitive transactions in microservices architectures. Jaeger implements the OpenTracing API (now part of OpenTelemetry) and provides capabilities for distributed context propagation, transaction monitoring, and root cause analysis. Its core function is to collect, store, and visualize detailed end-to-end traces that show the path and timing of requests as they traverse numerous services.
Glossary
Jaeger

What is Jaeger?
Jaeger is an open-source, end-to-end distributed tracing system used for monitoring and troubleshooting microservices-based architectures.
The system's architecture includes instrumentation libraries, a collector, a query service, and a web-based UI for exploring traces via flame graphs and dependency graphs. Jaeger supports multiple trace sampling strategies and storage backends like Cassandra and Elasticsearch. As a key tool in the observability stack, it enables engineers to understand system dependencies, optimize performance, and diagnose failures by providing a holistic view of request flows across service boundaries.
Key Features of Jaeger
Jaeger is an open-source, end-to-end distributed tracing system for monitoring and troubleshooting complex, microservices-based architectures. Its core features are designed for high scalability, deep visibility, and seamless integration with modern cloud-native ecosystems.
Adaptive Sampling Strategies
To manage the immense volume of trace data in production, Jaeger clients and collectors implement sophisticated trace sampling. This includes:
- Probabilistic (Head) Sampling: A fixed percentage of traces are sampled at the request's start.
- Rate-Limiting Sampling: Controls the number of traces per second.
- Remote-Sampled Configuration: Sampling strategies can be dynamically configured and fetched from a central Jaeger backend, allowing policies to be updated without redeploying services.
This ensures high-value traces (e.g., those with errors or high latency) are retained while controlling storage costs.
High-Performance Storage Backends
Jaeger's architecture separates the query service from pluggable storage layers, enabling it to scale with different database technologies optimized for trace data.
- Cassandra: The default, battle-tested backend for scalable, write-heavy workloads.
- Elasticsearch: Popular for its powerful full-text search and aggregation capabilities on trace attributes.
- gRPC Plugin: Supports custom storage implementations via a well-defined gRPC interface.
This design allows deployment flexibility, from self-managed Cassandra clusters to managed Elasticsearch services in the cloud.
Service Dependency Graphing
Jaeger automatically generates service graphs by analyzing trace data. These visual maps show:
- Service Nodes: All services involved in request flows.
- Directed Edges: The calls between services, annotated with metrics like request rates and error percentages.
- Dynamic Updates: Graphs update near-real-time as new trace data is ingested.
This feature is critical for understanding system topology, identifying unexpected dependencies, and visualizing the impact of failures across a microservice mesh.
Comparative Trace Analysis
A powerful feature for performance debugging, Jaeger's UI allows for side-by-side comparison of two traces. Engineers can:
- Select Traces: Choose two traces by Trace ID.
- Visual Diff: View a unified timeline highlighting differences in span durations and structure.
- Root-Cause Investigation: Quickly pinpoint which service or operation caused a regression in latency between two executions of the same logical request, such as before and after a deployment.
Cloud-Native Deployment Models
Jaeger is designed for modern infrastructure, offering multiple deployment patterns to suit different operational scales.
- All-in-One: A single binary with UI, collector, query, and embedded storage for local development and testing.
- Production Modular: Components (Agent, Collector, Query, UI) are deployed as separate, scalable services.
- Kubernetes-Native: Official Helm charts and operator (Jaeger Operator) manage the lifecycle, configuration, and provisioning of Jaeger instances on Kubernetes, supporting features like sidecar agent injection.
This flexibility supports evolution from a simple pilot to a large-scale, enterprise-grade tracing platform.
How Jaeger Works: Architecture & Data Flow
Jaeger is an open-source, end-to-end distributed tracing system used for monitoring and troubleshooting microservices-based architectures. Its architecture is designed to ingest, process, store, and visualize trace data at scale.
Jaeger's architecture follows a modular design with several core components. The Jaeger Client (or SDK) instruments applications to generate spans. These spans are sent via the OpenTelemetry Protocol (OTLP) or a proprietary protocol to the Jaeger Collector, which validates, processes, and writes them to storage. The Jaeger Query service retrieves traces from storage for the Jaeger UI, which provides interactive visualizations like flame graphs and service dependency graphs. Storage backends include Cassandra, Elasticsearch, and Kafka for buffering.
Data flow begins with instrumentation generating spans. The collector can perform trace sampling and enrichment before persistence. For querying, the Jaeger Query service fetches and reassembles traces from storage, enabling latency analysis and root cause investigation. The system supports distributed context propagation via standards like W3C Trace Context to maintain trace continuity across service boundaries, providing a complete end-to-end tracing view of request execution paths.
Jaeger vs. Other Distributed Tracing Tools
A technical comparison of Jaeger against other prominent distributed tracing backends and frameworks, focusing on architecture, protocol support, and operational characteristics relevant to agentic observability.
| Feature / Metric | Jaeger | Zipkin | Commercial APM (e.g., Datadog, New Relic) |
|---|---|---|---|
Primary Architecture | Monolithic & Microservices | Monolithic | SaaS / Managed Service |
Native Protocol Support | Jaeger Thrift, gRPC | HTTP JSON, Thrift | Vendor-specific agents & OTLP |
OpenTelemetry (OTel) Integration | Native via OTLP & Jaeger exporters | Via OpenTelemetry Collector adapters | Native OTLP ingestion, often with extensions |
Trace Storage Backend Options | Cassandra, Elasticsearch, Kafka+ES | Cassandra, Elasticsearch, MySQL | Proprietary cloud storage |
Tail Sampling Support | Yes, via remote sampling config | Limited (primarily head sampling) | Yes, advanced server-side sampling |
Service Dependency Graphing | Yes, built-in (Spark job) | Yes, built-in | Yes, real-time and historical |
Agentic Observability Features (e.g., LLM tool call tracing) | Requires custom instrumentation & span attributes | Requires custom instrumentation & span attributes | Pre-built integrations for AI/ML frameworks |
Deployment & Operational Overhead | Self-managed, moderate to high | Self-managed, low to moderate | Managed, low (vendor responsibility) |
Cost Model for High-Volume Traces | Infrastructure cost only | Infrastructure cost only | Usage-based pricing (per span/GB) |
Trace Query Language | Jaeger Query Language (simple UI filters) | Zipkin API (JSON query parameters) | Vendor-specific query DSL & UI |
Common Use Cases for Jaeger
Jaeger's primary function is to provide end-to-end visibility into request flows across microservices. Its open-source, vendor-neutral design makes it a foundational tool for several critical observability and operational tasks.
Frequently Asked Questions
Jaeger is a critical tool for monitoring modern, distributed software architectures. These questions address its core purpose, operation, and how it fits into the broader observability landscape.
Jaeger is an open-source, end-to-end distributed tracing system used for monitoring and troubleshooting transactions in complex, microservices-based architectures. It works by collecting, storing, and visualizing traces—detailed records of a request's journey across service boundaries. Developers instrument their applications (often using the OpenTelemetry SDK) to generate spans (timed operations). These spans, linked by a shared Trace ID, are sent to the Jaeger backend. Jaeger then provides a UI for querying traces, analyzing latency with flame graphs, and understanding service dependencies via service graphs.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Jaeger operates within a broader ecosystem of standards, tools, and concepts for distributed tracing. Understanding these related terms is essential for implementing effective observability.
Span
A span is the fundamental building block of a trace in Jaeger. It represents a single, named, and timed operation representing a contiguous segment of work within a service. Spans are nested and ordered to model the flow of a request. Key characteristics:
- Timing: Contains a start timestamp and a duration.
- Context: Holds a span ID and a trace ID for correlation.
- Metadata: Includes span kind (e.g., client, server), attributes (key-value pairs like
http.method=GET), status codes, and optional events or logs. - Relationships: Spans have parent-child relationships, forming a hierarchy. A root span initiates a trace, and child spans represent downstream operations.
Trace
A trace is a directed acyclic graph (DAG) of spans that represents the end-to-end journey of a single request as it traverses a distributed system. In Jaeger, a trace is the complete story assembled from all correlated spans. Essential aspects:
- Global Identifier: Unified by a globally unique Trace ID propagated across all services.
- Visualization: Jaeger's UI visualizes traces as timeline views or flame graphs, where the width of a bar represents duration, showing critical paths and bottlenecks.
- Purpose: Traces answer questions about request latency, error propagation, and service dependencies, moving debugging from single-service to system-wide analysis.
Distributed Context Propagation
This is the mechanism that enables tracing across service boundaries. For Jaeger to construct a complete trace, the trace context (Trace ID, Span ID, sampling flags) must be propagated from one service to the next. Key mechanisms:
- Inject/Extract: The tracing library injects context into outbound requests (e.g., as HTTP headers) and extracts it from inbound requests.
- Standards: Common formats include W3C Trace Context (modern standard) and B3 Propagation (originating from Zipkin). Jaeger clients support multiple propagators.
- Critical Function: Without proper propagation, spans appear as disconnected, orphaned operations, breaking the end-to-end view.
Service Graph
A service graph is a topological map automatically derived from trace data that shows the services in a system and the directional request flows (dependencies) between them. Jaeger can generate service graphs to provide a high-level architectural view. Key features:
- Dynamic Discovery: The graph is not manually configured; it is inferred by observing actual traffic patterns captured in traces.
- Operational Insights: Reveals unexpected dependencies, highlights services with high error rates or latency, and aids in impact analysis during incidents.
- Derived Metric: Often includes edge metrics like request counts per minute and error rates between services, turning trace data into system topology intelligence.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us