Inferensys

Glossary

Distributed Tracing

Distributed tracing is a method for observing requests as they flow through a distributed system, like an LLM application stack, by recording timing and metadata for operations across service boundaries.
Data scientist building training data pipeline on laptop, data preprocessing visible, technical workspace.
LLM PERFORMANCE MONITORING

What is Distributed Tracing?

Distributed tracing is a core observability method for profiling requests as they flow through a distributed system of microservices, such as an LLM application stack.

Distributed tracing is a method of observing and profiling requests as they flow through a distributed system of microservices, such as an LLM application stack, by recording timing and metadata for individual operations (spans) across service boundaries. It provides a unified, end-to-end view of a transaction's journey, correlating work done by disparate components like model inference engines, vector databases, and API gateways into a single trace. This is essential for diagnosing performance bottlenecks and understanding complex service dependencies in production.

In LLM operations, a trace visualizes the entire lifecycle of a user prompt, capturing spans for the initial API call, retrieval-augmented generation (RAG) lookups, model inference (including time to first token and inter-token latency), and any downstream tool calls. By instrumenting code with standards like OpenTelemetry, engineers can pinpoint the root cause of high latency or errors, whether in a specific microservice, a slow database query, or the LLM provider's API. This data feeds into Service Level Objective (SLO) monitoring and is crucial for systematic root cause analysis.

DISTRIBUTED TRACING

Core Components of a Trace

A distributed trace is a directed acyclic graph of causally related operations (spans) that record the execution path of a request through a system. These are the fundamental data structures that comprise a trace.

01

Span

A span is the fundamental unit of work in a trace, representing a single, named, and timed operation within a distributed transaction. It contains:

  • Operation Name: A human-readable identifier (e.g., llm.generate, vector_db.search).
  • Start & End Timestamps: Precise timing for latency calculation.
  • Span Context: A unique trace ID and span ID that propagate causality.
  • Attributes/Key-Value Pairs: Structured metadata (e.g., model="gpt-4", input_tokens=150).
  • Events: Log-like records with timestamps within the span's lifetime.
  • Status: Success, error, or unset, often with an error message. In an LLM request, separate spans would typically represent the API gateway, prompt preprocessing, the LLM inference call, and any retrieval steps.
02

Trace

A trace is a complete record of the journey of a single request (e.g., a user query to an LLM app). It is visualized as a tree or directed acyclic graph of spans, where the relationships between spans define the workflow. The root span represents the initial request entry point. For an LLM application performing Retrieval-Augmented Generation (RAG), a trace would show the parent API call spawning child spans for query understanding, vector database retrieval, and the final LLM generation, providing a holistic view of latency and dependencies.

03

Span Context & Propagation

Span Context is the immutable, portable object containing the minimal data needed to identify a span and its position in a trace: a trace ID, a span ID, trace flags, and other baggage items. Propagation is the mechanism by which this context is transmitted across process and network boundaries, typically via HTTP headers (e.g., traceparent from W3C Trace Context) or gRPC metadata. This is critical in LLM microservices architectures, allowing a trace initiated at a frontend service to be seamlessly continued through middleware, model endpoints, and external API calls, maintaining a unified view.

04

Attributes (Tags)

Attributes (also called tags) are key-value pairs attached to spans, trace data, or logs that provide descriptive, queryable metadata about the operation. They are essential for filtering, grouping, and analyzing telemetry data. Key attributes for LLM monitoring include:

  • Semantic Conventions: Standardized names (e.g., llm.request.model, gen_ai.system).
  • Business Context: user_id, session_id, prompt_template_version.
  • Performance Data: input_token_count, output_token_count, cache_hit.
  • Quality Indicators: contains_hallucination=true, evaluation_score=0.85. Proper attribute instrumentation turns raw timing data into actionable business and operational intelligence.
05

Events (Logs)

Events are structured log records with a timestamp, name, and optional attributes that are embedded within a span. They capture discrete moments during the span's execution, providing a detailed narrative. In LLM tracing, critical events include:

  • prompt.completed: With attributes for the final prompt text sent.
  • retrieval.started: Signaling a call to a vector database.
  • token.streamed: For tracking the progression of streaming outputs.
  • guardrail.triggered: Indicating a safety or moderation filter was activated.
  • exception: Capturing stack traces and error details. Events provide the high-resolution "why" behind span durations and statuses.
06

Links

A link associates a span with zero or more causally related span contexts from other traces. Unlike parent-child relationships, links represent a causal connection to a span outside of the trace's direct parent hierarchy. This is crucial for modeling batch or asynchronous processing in LLM systems. For example:

  • A background job that processes a batch of 100 user queries could create a single span linked to the 100 individual user request traces.
  • A span representing the training of a fine-tuned model could be linked to the traces of the inference requests whose feedback data triggered the training job. Links enable navigation between related but independently initiated workflows.
LLM PERFORMANCE MONITORING

Distributed Tracing in LLM Applications

Distributed tracing is a method of observing and profiling requests as they flow through a distributed system of microservices, such as an LLM application stack, by recording timing and metadata for individual operations (spans) across service boundaries.

In an LLM application, a single user request—like a complex query to a Retrieval-Augmented Generation (RAG) pipeline—triggers a cascade of operations across multiple services. Distributed tracing instruments these services to create a trace, a complete record of the request's journey. Each operation, such as a vector database lookup or the LLM's autoregressive decoding, is recorded as a span containing timing data, metadata, and causal links. This end-to-end visibility is essential for diagnosing performance bottlenecks, such as high inter-token latency or slow retrieval, and for enforcing Service Level Objectives (SLOs).

Implementing tracing typically involves frameworks like OpenTelemetry (OTel), which provides a vendor-neutral standard for generating and exporting telemetry data. Spans are correlated using unique trace identifiers propagated across service boundaries. This data enables precise root cause analysis (RCA) by pinpointing whether latency originates in the model's prefill stage, an external API call, or a overloaded KV cache. When integrated with metrics and logs, traces provide a holistic view of system health, crucial for canary deployments and maintaining reliability in complex, agentic architectures.

THREE PILLARS OF OBSERVABILITY

Tracing vs. Metrics vs. Logs

A comparison of the three primary telemetry data types used to monitor and debug distributed LLM applications, highlighting their distinct purposes, data models, and analysis methods.

FeatureTracingMetricsLogs

Primary Purpose

Profiling end-to-end request flow and causality across services

Aggregated measurement of system performance and health over time

Recording discrete, timestamped events with contextual details

Data Model

Hierarchical tree of spans (operations) forming a trace

Time-series numeric values, often with dimensional labels (tags)

Unstructured or semi-structured text lines or structured events (e.g., JSON)

Temporal Scope

Follows a single logical request (high cardinality)

Sampled continuously across all requests (low cardinality)

Event-driven, triggered by specific occurrences (high cardinality)

Key Use Case in LLM Ops

Diagnosing high latency in a specific RAG pipeline step

Monitoring overall Tokens per Second (TPS) and error rates

Auditing a specific user prompt that triggered a safety filter

Analysis Method

Latency breakdown, dependency mapping, bottleneck identification

Trending, alerting, aggregation (sum, avg, percentiles)

Pattern searching, filtering, and forensic investigation

Cardinality

Extremely High (unique per request/trace)

Low to Medium (bounded set of named metrics)

Very High (unique per event)

Storage Volume

High (detailed per-request data), often sampled

Low (aggregated numbers)

Very High (raw event text)

Primary Tool Examples

Jaeger, Tempo, OpenTelemetry Traces

Prometheus, Datadog Metrics, OpenTelemetry Metrics

Loki, Elasticsearch, OpenTelemetry Logs

DISTRIBUTED TRACING

Implementation Frameworks & Tools

Distributed tracing is implemented through a combination of instrumentation libraries, data collection agents, and visualization backends. These frameworks provide the necessary tooling to generate, propagate, collect, and analyze trace data across a complex LLM application stack.

02

Trace Visualization & Analysis

Once collected, trace data is visualized in specialized backends that transform raw spans into actionable insights:

  • Jaeger: Open-source, end-to-end distributed tracing system for complex microservice architectures. It provides a UI for visualizing trace waterfalls and analyzing latency bottlenecks.
  • Grafana Tempo: A high-volume, cost-effective trace storage backend that integrates tightly with Grafana, Loki, and Prometheus for correlated observability.
  • Commercial APMs: Tools like Datadog, New Relic, and Dynatrace offer integrated tracing with advanced analytics, service maps, and AI-powered anomaly detection for LLM pipelines. These tools allow engineers to see the full journey of a user prompt through retrieval, model inference, and post-processing.
03

Span & Trace Anatomy

A trace is a directed acyclic graph of spans, each representing a named, timed operation. Key components include:

  • Trace ID: A globally unique identifier for the entire request journey.
  • Span ID: A unique identifier for a single operation within the trace.
  • Parent-Span Relationships: Defines the causal and temporal structure (e.g., an 'LLM Call' span is the parent of multiple 'Token Generation' spans).
  • Attributes: Key-value pairs storing metadata (e.g., model="gpt-4", input_tokens=1500, user_id="abc123").
  • Events: Timed annotations with a payload (e.g., "tool.called", "hallucination.detected").
  • Status: Success, error, or unset, often with an error message.
04

Instrumentation Patterns for LLMs

Effective tracing requires instrumenting key components of the LLM stack:

  • LLM Provider Wrappers: Auto-instrumentation or manual spans around calls to OpenAI, Anthropic, or self-hosted model endpoints, capturing model name, token counts, and latency.
  • Vector Database & Retrieval: Spans for embedding generation, similarity search, and context window assembly.
  • Tool/Function Calling: Spans that capture the execution of external APIs or code with inputs, outputs, and duration.
  • Orchestration Frameworks: Tracing within LangChain, LlamaIndex, or custom agents to visualize decision paths and iteration loops.
  • Business Logic & APIs: Traditional application spans that provide context for the LLM's role in a larger workflow.
05

Context Propagation

Context propagation is the mechanism that ensures a single trace ID flows through all services, including third-party LLM APIs. This is critical for end-to-end visibility.

  • W3C TraceContext: The standard HTTP header format (traceparent, tracestate) injected into outbound requests.
  • LLM Provider Support: Some providers allow passing metadata headers, which can be used to inject trace context for correlation in your backend.
  • Asynchronous Operations: Managing context across async/await boundaries and concurrent tasks using framework-specific context managers (e.g., contextvars in Python). Without proper propagation, traces become fragmented, breaking the view of the user request.
06

Sampling & Tail-Based Decisions

Recording every trace is often prohibitively expensive. Sampling strategies control data volume:

  • Head-based Sampling: A decision to sample is made at the start of a trace (e.g., sample 10% of all requests). Simple but can miss rare, important errors.
  • Tail-based Sampling: All traces are initially recorded with a low-fidelity buffer. A decision is made after request completion based on its characteristics (e.g., high latency, error status, specific endpoint). Only then is the full trace sent to the backend. This is more complex but ensures critical traces are never lost.
  • LLM-Specific Rules: Sampling can be configured to always capture traces for certain high-value users, experimental model versions, or prompts containing specific keywords.
DISTRIBUTED TRACING

Frequently Asked Questions

Distributed tracing is a critical observability method for profiling requests as they flow through a complex, microservices-based LLM application stack. These questions address its core mechanisms, implementation, and value for monitoring performance and diagnosing issues.

Distributed tracing is a method of observing and profiling requests as they flow through a distributed system by recording timing and metadata for individual operations across service boundaries. It works by instrumenting application code to generate traces, which are composed of nested spans. A trace represents the entire journey of a single request (e.g., a user query to an LLM). Each span represents a distinct unit of work within that request, such as a call to a vector database, the LLM inference itself, or a post-processing step. Spans are linked by a unique trace ID and contain metadata like start/end timestamps, operation names, and key-value attributes (tags). This structured data is collected by a tracing backend (like Jaeger or a vendor system) for visualization and analysis, creating a complete, timed graph of the request's path.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.