Inferensys

Glossary

Zipkin

Zipkin is an open-source distributed tracing system that collects and visualizes timing data to diagnose latency problems in microservice architectures.
Architect reviewing LLM integration architecture on laptop, system diagrams visible, modern technical office setup.
DISTRIBUTED TRACING SYSTEM

What is Zipkin?

Zipkin is an open-source distributed tracing system for collecting and visualizing timing data to troubleshoot latency problems in service-oriented architectures.

Zipkin is an open-source distributed tracing system that collects timing data for requests as they propagate across services in a microservices architecture. It helps developers and SREs visualize the path and latency of requests, identifying bottlenecks and failures by instrumenting applications to report spans—timed operations representing work like an HTTP call or database query. Managed by the OpenZipkin community, it provides a backend for storage, a query API, and a web UI for analyzing trace data.

The system operates by having instrumented applications send trace data to a Zipkin backend via transports like HTTP or Apache Kafka. It stores this data, allowing queries to reconstruct the complete request flow. Zipkin popularized the B3 propagation header format for transmitting trace context and is often integrated via OpenTelemetry collectors. While a foundational tool, it is part of the broader observability ecosystem for understanding system behavior through end-to-end tracing.

DISTRIBUTED TRACING SYSTEM

Key Features of Zipkin

Zipkin is an open-source distributed tracing system that collects timing data for requests as they propagate across service boundaries, enabling latency analysis and dependency mapping in microservice architectures.

01

Span-Based Data Model

Zipkin's fundamental data unit is the span, which represents a single, timed operation within a service (e.g., an HTTP call or database query). Spans contain:

  • Name: The operation name.
  • Timestamp & Duration: Precise timing data.
  • Tags: Key-value pairs for contextual metadata (e.g., http.method=GET).
  • Annotations: Timestamped event logs within a span's lifetime. Spans are linked via parent-child relationships using Trace ID and Span ID to reconstruct the complete request path.
02

Trace Context Propagation

To correlate work across services, Zipkin propagates trace context using standardized headers. It primarily supports:

  • B3 Propagation: The original header format (X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId).
  • W3C Trace Context: The modern W3C standard for interoperability with other systems like OpenTelemetry. This propagation is handled by instrumentation libraries or a propagator, ensuring the Trace ID is carried through HTTP, gRPC, and messaging systems to maintain a continuous trace.
03

Multi-Component Architecture

Zipkin is designed as a collection of loosely coupled components:

  • Instrumented Application: Services generate trace data (spans).
  • Reporters/Exporters: Send spans from the application to a Zipkin collector (often via HTTP or Kafka).
  • Collector: Validates, indexes, and persists spans to storage.
  • Storage Backend: Supports pluggable options including Elasticsearch, Cassandra, and MySQL.
  • Query Service & API: Retrieves traces and dependencies from storage.
  • Web UI: Provides a graphical interface for finding and visualizing traces as flame graphs.
04

Dependency Analysis & Service Graphs

By analyzing trace data, Zipkin can automatically generate service dependency graphs. This visualization maps:

  • Nodes: Represent each service in the architecture.
  • Edges: Show the direction and volume of calls between services. This feature is critical for understanding systemic topology, identifying unexpected dependencies, and visualizing the impact of a service failure. The graph is derived from span kind attributes (e.g., Client, Server).
05

Integration & Instrumentation Ecosystem

Zipkin offers broad support for different frameworks and languages through community-maintained libraries. Instrumentation can be achieved via:

  • Manual Instrumentation: Using the Zipkin client library API directly.
  • Framework Instrumentation: Pre-built tracing for Spring (Sleuth), JAX-RS, gRPC, etc.
  • OpenTelemetry Integration: Spans generated by OpenTelemetry (OTel) SDKs can be exported to Zipkin using the OpenTelemetry Protocol (OTLP) or Zipkin-formatted exporters, making it a viable backend for OTel-based observability pipelines.
06

Sampling for Scalability

To manage data volume and storage costs in high-throughput systems, Zipkin supports trace sampling. This is typically configured at the instrumentation level.

  • Head-based Sampling: A decision is made at the start of a trace (e.g., sample 10% of requests).
  • Delegated Sampling: Can be integrated with external sampling proxies. While Zipkin itself does not perform tail sampling (decision after trace completion), this can be implemented upstream using a component like the OpenTelemetry Collector before data is sent to Zipkin storage.
DISTRIBUTED TRACING SYSTEM

How Zipkin Works

Zipkin is an open-source distributed tracing system that collects and visualizes timing data for requests as they propagate across a microservices architecture.

Zipkin operates by instrumenting services to generate spans—timed records of operations like API calls or database queries. These spans, linked by a shared trace ID, are reported to a Zipkin backend. The system uses context propagation via headers (like the B3 format) to pass this ID between services, maintaining a continuous end-to-end trace of the entire request lifecycle for latency analysis.

The collected trace data is stored and indexed, enabling visualization through a flame graph to identify performance bottlenecks. Zipkin also generates dependency graphs showing service interactions. It integrates with instrumentation libraries and the OpenTelemetry Collector via protocols like JSON over HTTP, providing a focused tool for troubleshooting distributed system latency without built-in metrics or logging.

DISTRIBUTED TRACING SYSTEMS

Zipkin vs. Jaeger vs. OpenTelemetry

A technical comparison of three major open-source projects for distributed tracing, focusing on architecture, data collection, and ecosystem role.

Feature / ComponentZipkinJaegerOpenTelemetry

Primary Role

Distributed tracing backend and API

End-to-end distributed tracing system

Vendor-neutral telemetry framework (API/SDK/Collector)

Instrumentation Model

Manual or via community libraries; B3 propagation

Manual or via client libraries; supports multiple propagators

Standardized API/SDK for manual and auto-instrumentation; defines OTLP

Data Collection Protocol

HTTP/JSON, Scribe (Thrift), Kafka

UDP/Thrift, HTTP/JSON, gRPC, Kafka

OTLP (gRPC/HTTP), also supports Zipkin & Jaeger protocols

Native Trace Context Propagation

B3 Propagation (X-B3-* headers)

B3, W3C Trace Context, Jaeger baggage

W3C Trace Context (reference implementation); supports B3 & Jaeger

Core Architecture

Collector, Storage, Query Service, UI

Agent, Collector, Query Service, UI, Ingester

API/SDK (per language), Collector (receivers/processors/exporters)

Default Storage Backend

In-memory, Cassandra, Elasticsearch, MySQL

In-memory, Cassandra, Elasticsearch, Kafka+ES

None (telemetry exporter); Collector supports many backends

Sampling Strategy

Delegated to client/tracer; collector can sample

Client-side (probabilistic, rate-limiting, remote), Tail sampling via collector

Head sampling in SDK; Tail sampling in Collector

Vendor Lock-in Risk

Low (focused on tracing, simple API)

Low (open-source, can export data)

Very Low (industry standard, decouples instrumentation from backend)

Integration with APM/Backends

Many backends support Zipkin format ingestion

Direct Jaeger backend or via compatible formats

Primary integration path for modern APM tools (via OTLP)

Deployment Model

Single binary or separate components

All-in-one binary or scalable microservice deployment

Library/SDK in app, Collector as sidecar/daemon/central service

ZIPKIN

Frequently Asked Questions

Zipkin is an open-source distributed tracing system essential for monitoring latency and dependencies in microservice and agentic architectures. These FAQs address its core mechanisms, integration, and role in modern observability.

Zipkin is an open-source distributed tracing system that collects and visualizes timing data for requests as they propagate through a distributed system. It works by instrumenting services to generate spans—timed records of individual operations—which are correlated using a unique trace ID to reconstruct the complete end-to-end path of a request. Spans are sent to a Zipkin backend, where they are stored, analyzed, and presented as flame graphs or dependency graphs to help engineers identify latency bottlenecks and service dependencies.

Key components include:

  • Instrumented Applications: Services generate trace data.
  • Zipkin Collector: Receives span data via HTTP or messaging queues.
  • Storage Backend: Supports databases like Elasticsearch or Cassandra.
  • Zipkin UI: A web interface for querying and visualizing traces.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.