Inferensys

Glossary

Trace

A trace is a collection of spans that represents the end-to-end path of a request as it propagates through a distributed system, forming a directed acyclic graph (DAG) of operations.
Operations room with a large monitor wall for system visibility and control.
DISTRIBUTED TRACE COLLECTION

What is a Trace?

A trace is the fundamental data structure for observing request flow in distributed systems, particularly critical for monitoring autonomous agents.

A trace is a collection of spans that represents the complete end-to-end path of a single request as it propagates through a distributed system, forming a directed acyclic graph (DAG) of operations. In agentic systems, a trace captures the entire execution journey, from the initial user prompt or trigger, through internal reasoning steps and tool calls, to the final action or response. It is uniquely identified by a Trace ID that correlates all work across service and process boundaries.

Traces are composed of hierarchically nested spans, where parent-child relationships define the flow of execution and causality. This structure allows engineers to reconstruct the precise sequence of events, identify performance bottlenecks (latency), and diagnose failures in complex, multi-service workflows. For autonomous agents, traces are essential for auditing behavior, ensuring deterministic execution, and understanding the agent's decision-making process by providing a complete, time-ordered record of its internal state changes and external interactions.

STRUCTURAL ELEMENTS

Key Components of a Trace

A trace is a directed acyclic graph (DAG) composed of interconnected spans. Understanding its core components is essential for analyzing system performance and diagnosing failures in distributed architectures.

01

The Span

A span is the fundamental building block of a trace, representing a single, named, and timed operation within a service. It captures the execution of a contiguous unit of work, such as:

  • A function call or method execution
  • An HTTP request to an external API
  • A database query or transaction

Each span contains a start time, duration, status code (e.g., OK, ERROR), and a set of attributes for metadata. Spans are linked via parent-child relationships to form the trace's hierarchical structure.

02

Trace & Span Identifiers

Unique identifiers are crucial for correlating data across a distributed system.

  • Trace ID: A globally unique, immutable identifier (typically a 16-byte array) assigned to the entire request. Every span in the same trace shares this ID, enabling end-to-end correlation.
  • Span ID: A unique identifier for an individual span within a trace. It is used to establish the parent-child links between spans.
  • Parent Span ID: The identifier of the span that directly caused the current span's work. A span without a parent ID is a root span, representing the initial operation of the trace.
03

Span Context & Propagation

Span context is the immutable tracing state that must be propagated across process and service boundaries to maintain trace continuity. It contains the trace ID, span ID, trace flags (e.g., sampling decision), and trace state (vendor-specific data).

Distributed context propagation is the mechanism for passing this context, typically via:

  • HTTP headers (using standards like W3C Trace Context or B3 Propagation)
  • Messaging system metadata (e.g., Kafka headers, gRPC metadata)

A propagator component in the tracing SDK handles the injection (outbound) and extraction (inbound) of this context.

04

Span Attributes & Events

These components add rich, queryable metadata to a span.

  • Attributes: Key-value pairs that describe the operation. Examples include http.method="GET", db.statement="SELECT * FROM users", or custom business data like user.id="12345".
  • Events: Timed, structured logs attached to a span that represent singular occurrences during its lifetime, such as an exception being thrown, a cache miss, or a significant state change. Each event has a name, timestamp, and its own set of attributes.
05

Span Links & Span Kind

These elements define semantic relationships and roles.

  • Span Links: A reference from one span to a span in a different trace. They model causal relationships that are not parent-child, such as a batch job processing an item that originated from an asynchronous queue.
  • Span Kind: A classification specifying the span's role in the trace topology. Core kinds include:
    • Server: For the receiver of a remote operation (e.g., an HTTP server handler).
    • Client: For the initiator of a remote operation (e.g., an outgoing HTTP call).
    • Internal: For operations within the application boundary with no remote context.
    • Producer/Consumer: For messaging systems.
06

Trace as a Directed Acyclic Graph (DAG)

A complete trace is a collection of spans that forms a Directed Acyclic Graph (DAG), not merely a linear chain. This structure emerges because:

  • A single parent span can have multiple concurrent child spans (e.g., fan-out API calls).
  • Span links can create edges between spans in different traces.
  • The DAG structure is visualized in tools via flame graphs (showing nested duration) and service graphs (showing inter-service dependencies). The root span is the graph's entry point, and the collective timing of all spans defines the request's total latency.
MECHANISM

How Distributed Tracing Works

Distributed tracing is a diagnostic technique that reconstructs the complete lifecycle of a single request as it traverses a complex, multi-service architecture.

A trace is a directed acyclic graph (DAG) of spans, where each span represents a discrete unit of work within a service, such as a database query or an API call. The system is initiated when a root service assigns a globally unique Trace ID and creates the initial span. This context, containing the Trace ID and the current Span ID, is then propagated—typically via HTTP headers like those defined in W3C Trace Context—to every downstream service called during the request's execution.

Each instrumented service uses the propagated context to create child spans, linking them to the parent via the Span ID, thereby building the complete graph. After the request finishes, all spans are collected, often via an OpenTelemetry Collector, and assembled using the shared Trace ID. This reconstructed timeline is visualized in tools like Jaeger or Zipkin as a flame graph, enabling engineers to pinpoint latency bottlenecks, failed services, and unexpected execution paths across the entire distributed system.

DISTRIBUTED TRACE COLLECTION

Traces in Agentic Observability

In agentic systems, a trace is the definitive record of an autonomous agent's execution path, capturing its internal reasoning, external tool calls, and state changes as a directed acyclic graph (DAG) of operations.

01

Core Definition & Structure

A trace is a collection of spans that represents the end-to-end path of a single request or agent execution as it propagates through a distributed system. It forms a directed acyclic graph (DAG) where:

  • Each span is a named, timed operation representing a unit of work.
  • Parent-child relationships define the flow and hierarchy of operations.
  • The root span initiates the trace, with subsequent spans as children or follows-from links. This structure is essential for visualizing the complete lifecycle of an agent's task, from initial prompt to final action.
02

The Span: Fundamental Building Block

A span is the atomic unit of a trace. For agentic observability, spans capture distinct phases of agentic work:

  • Internal Reasoning: A span for a planning cycle, chain-of-thought, or reflection step.
  • Tool/API Execution: A span for an external function call, database query, or API request, including duration and success status.
  • Memory Operations: Spans for reading from or writing to a vector store or knowledge graph. Each span contains critical metadata: a span ID, parent span ID, start/end timestamps, span kind (e.g., INTERNAL, CLIENT), and attributes (key-value pairs detailing the operation).
03

Context Propagation Across Boundaries

For a trace to be truly distributed, span context must propagate across process and network boundaries. This is the mechanism that ties an agent's internal reasoning to its external API calls.

  • Trace ID & Span ID: A globally unique Trace ID identifies the entire execution. The current Span ID identifies the specific operation.
  • Propagation Standards: Context is carried via headers using standards like W3C Trace Context or B3 Propagation.
  • Agentic Specifics: When an agent calls a tool, the SDK injects the current span context into the HTTP request. The tool's service extracts it, creating a child span, thereby extending the trace into the external service.
04

Enrichment for Agentic Understanding

Raw spans are useful for timing; enriched spans are critical for auditing and debugging agent behavior. Trace enrichment adds semantic, business, and agent-specific context.

  • Agent State: Attach the current goal, plan step, or conversation turn as span attributes.
  • Tool Call Details: Enrich spans with the exact function name, parameters, and parsed results.
  • Cost Telemetry: Add attributes for LLM token usage, model name, and API call cost.
  • Business Context: Include user ID, session ID, or transaction ID to link technical traces to business outcomes.
05

Visualization: Flame Graphs & Service Graphs

Traces are visualized to diagnose performance and understand flow.

  • Flame Graph: A hierarchical visualization where the width of a horizontal bar represents a span's duration. It instantly shows the critical path and which internal reasoning step or tool call caused latency.
  • Service Graph (or Dependency Map): A topological map automatically generated from trace data. For multi-agent systems, it shows agents as nodes and their communication (RPC, messages) as edges, revealing the interaction network and upstream/downstream dependencies.
06

Sampling & Data Volume Management

Capturing every trace is often prohibitively expensive. Trace sampling strategically reduces volume.

  • Head Sampling: Decision made at the start of a request (e.g., sample 10% of all agent sessions). Simple but may miss rare, important traces.
  • Tail Sampling: Decision made after the trace is complete, based on its full content. Crucial for agents, as it allows rules like:
    • Sample 100% of traces with errors (e.g., tool call failure).
    • Sample 100% of traces where final answer confidence < threshold.
    • Sample traces with latency > 2s. This ensures high-value agent executions are always retained for analysis.
TRACE

Frequently Asked Questions

A trace is the foundational data structure for understanding request flow in distributed systems. These questions address its core mechanics, implementation, and value for observability.

A trace is a collection of spans that represents the complete, end-to-end path of a single logical request as it propagates through a distributed system, forming a directed acyclic graph (DAG) of operations. It provides a holistic view of the request's journey, capturing the causal relationships and timing between all participating services, databases, and external APIs. In agentic systems, a trace visualizes the entire cognitive workflow—from initial user prompt, through planning and tool execution, to final response—enabling engineers to audit autonomy and pinpoint latency bottlenecks.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.