Zipkin is an open-source distributed tracing system that collects timing data for requests as they propagate across services in a microservices architecture. It helps developers and SREs visualize the path and latency of requests, identifying bottlenecks and failures by instrumenting applications to report spans—timed operations representing work like an HTTP call or database query. Managed by the OpenZipkin community, it provides a backend for storage, a query API, and a web UI for analyzing trace data.
Glossary
Zipkin

What is Zipkin?
Zipkin is an open-source distributed tracing system for collecting and visualizing timing data to troubleshoot latency problems in service-oriented architectures.
The system operates by having instrumented applications send trace data to a Zipkin backend via transports like HTTP or Apache Kafka. It stores this data, allowing queries to reconstruct the complete request flow. Zipkin popularized the B3 propagation header format for transmitting trace context and is often integrated via OpenTelemetry collectors. While a foundational tool, it is part of the broader observability ecosystem for understanding system behavior through end-to-end tracing.
Key Features of Zipkin
Zipkin is an open-source distributed tracing system that collects timing data for requests as they propagate across service boundaries, enabling latency analysis and dependency mapping in microservice architectures.
Span-Based Data Model
Zipkin's fundamental data unit is the span, which represents a single, timed operation within a service (e.g., an HTTP call or database query). Spans contain:
- Name: The operation name.
- Timestamp & Duration: Precise timing data.
- Tags: Key-value pairs for contextual metadata (e.g.,
http.method=GET). - Annotations: Timestamped event logs within a span's lifetime. Spans are linked via parent-child relationships using Trace ID and Span ID to reconstruct the complete request path.
Trace Context Propagation
To correlate work across services, Zipkin propagates trace context using standardized headers. It primarily supports:
- B3 Propagation: The original header format (
X-B3-TraceId,X-B3-SpanId,X-B3-ParentSpanId). - W3C Trace Context: The modern W3C standard for interoperability with other systems like OpenTelemetry. This propagation is handled by instrumentation libraries or a propagator, ensuring the Trace ID is carried through HTTP, gRPC, and messaging systems to maintain a continuous trace.
Multi-Component Architecture
Zipkin is designed as a collection of loosely coupled components:
- Instrumented Application: Services generate trace data (spans).
- Reporters/Exporters: Send spans from the application to a Zipkin collector (often via HTTP or Kafka).
- Collector: Validates, indexes, and persists spans to storage.
- Storage Backend: Supports pluggable options including Elasticsearch, Cassandra, and MySQL.
- Query Service & API: Retrieves traces and dependencies from storage.
- Web UI: Provides a graphical interface for finding and visualizing traces as flame graphs.
Dependency Analysis & Service Graphs
By analyzing trace data, Zipkin can automatically generate service dependency graphs. This visualization maps:
- Nodes: Represent each service in the architecture.
- Edges: Show the direction and volume of calls between services. This feature is critical for understanding systemic topology, identifying unexpected dependencies, and visualizing the impact of a service failure. The graph is derived from span kind attributes (e.g., Client, Server).
Integration & Instrumentation Ecosystem
Zipkin offers broad support for different frameworks and languages through community-maintained libraries. Instrumentation can be achieved via:
- Manual Instrumentation: Using the Zipkin client library API directly.
- Framework Instrumentation: Pre-built tracing for Spring (Sleuth), JAX-RS, gRPC, etc.
- OpenTelemetry Integration: Spans generated by OpenTelemetry (OTel) SDKs can be exported to Zipkin using the OpenTelemetry Protocol (OTLP) or Zipkin-formatted exporters, making it a viable backend for OTel-based observability pipelines.
Sampling for Scalability
To manage data volume and storage costs in high-throughput systems, Zipkin supports trace sampling. This is typically configured at the instrumentation level.
- Head-based Sampling: A decision is made at the start of a trace (e.g., sample 10% of requests).
- Delegated Sampling: Can be integrated with external sampling proxies. While Zipkin itself does not perform tail sampling (decision after trace completion), this can be implemented upstream using a component like the OpenTelemetry Collector before data is sent to Zipkin storage.
How Zipkin Works
Zipkin is an open-source distributed tracing system that collects and visualizes timing data for requests as they propagate across a microservices architecture.
Zipkin operates by instrumenting services to generate spans—timed records of operations like API calls or database queries. These spans, linked by a shared trace ID, are reported to a Zipkin backend. The system uses context propagation via headers (like the B3 format) to pass this ID between services, maintaining a continuous end-to-end trace of the entire request lifecycle for latency analysis.
The collected trace data is stored and indexed, enabling visualization through a flame graph to identify performance bottlenecks. Zipkin also generates dependency graphs showing service interactions. It integrates with instrumentation libraries and the OpenTelemetry Collector via protocols like JSON over HTTP, providing a focused tool for troubleshooting distributed system latency without built-in metrics or logging.
Zipkin vs. Jaeger vs. OpenTelemetry
A technical comparison of three major open-source projects for distributed tracing, focusing on architecture, data collection, and ecosystem role.
| Feature / Component | Zipkin | Jaeger | OpenTelemetry |
|---|---|---|---|
Primary Role | Distributed tracing backend and API | End-to-end distributed tracing system | Vendor-neutral telemetry framework (API/SDK/Collector) |
Instrumentation Model | Manual or via community libraries; B3 propagation | Manual or via client libraries; supports multiple propagators | Standardized API/SDK for manual and auto-instrumentation; defines OTLP |
Data Collection Protocol | HTTP/JSON, Scribe (Thrift), Kafka | UDP/Thrift, HTTP/JSON, gRPC, Kafka | OTLP (gRPC/HTTP), also supports Zipkin & Jaeger protocols |
Native Trace Context Propagation | B3 Propagation (X-B3-* headers) | B3, W3C Trace Context, Jaeger baggage | W3C Trace Context (reference implementation); supports B3 & Jaeger |
Core Architecture | Collector, Storage, Query Service, UI | Agent, Collector, Query Service, UI, Ingester | API/SDK (per language), Collector (receivers/processors/exporters) |
Default Storage Backend | In-memory, Cassandra, Elasticsearch, MySQL | In-memory, Cassandra, Elasticsearch, Kafka+ES | None (telemetry exporter); Collector supports many backends |
Sampling Strategy | Delegated to client/tracer; collector can sample | Client-side (probabilistic, rate-limiting, remote), Tail sampling via collector | Head sampling in SDK; Tail sampling in Collector |
Vendor Lock-in Risk | Low (focused on tracing, simple API) | Low (open-source, can export data) | Very Low (industry standard, decouples instrumentation from backend) |
Integration with APM/Backends | Many backends support Zipkin format ingestion | Direct Jaeger backend or via compatible formats | Primary integration path for modern APM tools (via OTLP) |
Deployment Model | Single binary or separate components | All-in-one binary or scalable microservice deployment | Library/SDK in app, Collector as sidecar/daemon/central service |
Frequently Asked Questions
Zipkin is an open-source distributed tracing system essential for monitoring latency and dependencies in microservice and agentic architectures. These FAQs address its core mechanisms, integration, and role in modern observability.
Zipkin is an open-source distributed tracing system that collects and visualizes timing data for requests as they propagate through a distributed system. It works by instrumenting services to generate spans—timed records of individual operations—which are correlated using a unique trace ID to reconstruct the complete end-to-end path of a request. Spans are sent to a Zipkin backend, where they are stored, analyzed, and presented as flame graphs or dependency graphs to help engineers identify latency bottlenecks and service dependencies.
Key components include:
- Instrumented Applications: Services generate trace data.
- Zipkin Collector: Receives span data via HTTP or messaging queues.
- Storage Backend: Supports databases like Elasticsearch or Cassandra.
- Zipkin UI: A web interface for querying and visualizing traces.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Zipkin operates within a broader ecosystem of distributed tracing concepts and tools. Understanding these related terms is essential for implementing effective observability.
Span & Trace
The span and the trace are the two fundamental data models in distributed tracing that Zipkin collects and visualizes.
- Span: Represents a single, named, timed operation within a service (e.g., a database query, an HTTP handler). A span contains:
- Span ID: A unique identifier for this operation.
- Parent ID: A reference to the span that caused this work (creating a hierarchy).
- Timestamps: Start and duration.
- Tags/Annotations: Key-value metadata (e.g.,
http.method=GET).
- Trace: A directed acyclic graph (DAG) of spans that represents the end-to-end journey of a request. All spans in a trace share a single, globally unique Trace ID. Zipkin's UI reconstructs these hierarchies to show the flame graph visualization of a request's path.
Context Propagation
Context propagation is the critical mechanism that enables distributed tracing by passing trace identifiers (Trace ID, Span ID) across service boundaries. Without it, spans cannot be linked into a coherent trace.
- Propagators: Code components that inject context into outbound requests (e.g., HTTP headers) and extract it from inbound requests.
- Formats: Zipkin originally popularized the B3 Propagation format, which uses headers like
X-B3-TraceId. The W3C Trace Context standard is now widely adopted for interoperability. - Process: When Service A calls Service B, Service A's tracing SDK injects the current trace and span IDs into the HTTP headers. Service B's SDK extracts them, creating a new child span linked to the parent. This propagation is what allows Zipkin to track requests across network hops, queues, and asynchronous processes.
APM (Application Performance Monitoring)
Application Performance Monitoring (APM) is the overarching practice of monitoring software performance and availability. Distributed tracing, as implemented by Zipkin, is a core pillar of a modern APM strategy.
- APM Pillars: Typically includes Distributed Tracing, Metrics (time-series data), and Logs (event records).
- Zipkin's Role: Zipkin is a specialized tracing backend. Commercial and open-source APM Suites (e.g., Datadog APM, New Relic, Grafana Tempo) often bundle tracing with metrics, logging, and alerting into a unified platform.
- Use Case: While Zipkin excels at deep-dive latency analysis for microservices, a full APM solution provides a broader view, correlating traces with system metrics (CPU, error rates) and business KPIs.
Service Dependency Graph
A Service Dependency Graph (or Service Graph) is a topological map automatically generated from trace data by systems like Zipkin. It visualizes the runtime dependencies between services in a microservice architecture.
- Generation: Zipkin analyzes trace data to infer which services call which other services and the volume/latency of those calls.
- Purpose: Provides an immediate, visual understanding of system architecture and critical data flows. It is essential for:
- Impact Analysis: Understanding which services will be affected by an outage.
- Architecture Drift: Identifying unintended or new dependencies.
- Bottleneck Identification: Spotting services with high fan-out or latency. This graph moves observability from analyzing single requests (traces) to understanding the holistic system behavior.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us