A service graph is a dynamic, topological model of a distributed system, automatically generated from distributed tracing data. It visualizes services as nodes and the request flows between them as directed edges, revealing runtime dependencies and communication patterns. This graph is a foundational component of observability, providing a real-time architectural map that is more accurate than static configuration files.
Glossary
Service Graph

What is a Service Graph?
A service graph is a topological map derived from trace data that visually represents the services in a system and the directional request flows (dependencies) between them.
The graph is constructed by analyzing span data within traces, aggregating calls between services to infer dependency links. It enables critical operational workflows like impact analysis during outages, latency hotspot identification, and validating infrastructure-as-code changes. In agentic systems, it maps interactions between autonomous components and external tool calls, providing clarity in complex, dynamic execution environments.
Key Characteristics of a Service Graph
A service graph is a topological map derived from trace data that visually represents the services in a system and the directional request flows (dependencies) between them. Its key characteristics define its utility for system analysis.
Dynamic and Derived from Telemetry
A service graph is not a static configuration file. It is dynamically generated from observed telemetry data, primarily distributed traces. As new services are deployed or communication patterns change, the graph updates automatically to reflect the current system topology. This makes it an accurate, real-time representation of the actual runtime dependencies, which often differ from architectural diagrams or intended design.
Directed Graph Structure
The graph is a directed graph (digraph) where nodes represent services and edges represent directional request flows. An edge from Service A to Service B indicates that A calls B. This directionality is crucial for understanding dependency chains and performing root cause analysis. For example, if Service B is slow, the graph immediately shows which upstream services (like A) will be impacted.
Enriched with Performance Metadata
Nodes and edges are annotated with aggregated performance metrics derived from the underlying spans and traces. Key metadata includes:
- Request Rate (RPS) between services
- Error Rate (e.g., 4xx/5xx HTTP status codes)
- Latency percentiles (P50, P95, P99)
- Protocol information (gRPC, HTTP, messaging) This enrichment transforms the graph from a simple map into a performance dashboard, allowing engineers to visually identify bottlenecks and anomalous behavior.
Reveals Implicit and Hidden Dependencies
Service graphs excel at uncovering implicit dependencies that are not documented or known to developers. This includes:
- Transitive dependencies: Services that are called deep in a chain.
- Shared backing services: Two unrelated services both depending on the same database or cache.
- Fan-out patterns: A single service calling many downstream services in parallel.
- Third-party API calls: External dependencies that impact system reliability. This visibility is critical for impact analysis during incidents and change management.
Foundation for Advanced Analytics
The graph's structured representation of service relationships enables sophisticated analytical workflows. It serves as the foundation for:
- Dependency analysis: Identifying critical services with many downstream dependents.
- Failure propagation modeling: Simulating how an outage in one node affects the system.
- Change risk assessment: Predicting the blast radius of a deployment.
- Capacity planning: Understanding traffic flow to right-size infrastructure. In AI/agentic systems, it can model tool call dependencies and multi-agent communication pathways.
Integral to Observability Platforms
A service graph is rarely a standalone visualization. It is a core component of Application Performance Monitoring (APM) and observability platforms. It integrates deeply with other telemetry:
- Click-through to traces: Drilling down from a graph edge to view sample traces for that call path.
- Correlation with metrics and logs: Linking graph nodes to service-level dashboards and logs.
- Alerting integration: Configuring alerts based on dependency health (e.g., elevated error rate on a specific edge). This creates a unified context for troubleshooting.
How is a Service Graph Generated?
A service graph is not manually defined but dynamically derived from telemetry data, providing a real-time map of system dependencies.
A service graph is generated by aggregating and analyzing distributed trace data over a defined time window. Observability platforms collect spans—each representing an operation within a service—and use the parent-child relationships and network metadata within them to infer which services communicate. This process identifies nodes (services) and edges (request flows), constructing a directed graph that visualizes dependencies and call patterns without static configuration.
The generation involves trace enrichment and aggregation logic, often within a trace pipeline or backend analytics engine. Spans are grouped by service names, and statistical analysis of span attributes like peer.service or HTTP host determines connection direction and strength. This automated derivation allows the graph to dynamically update as deployments change, providing an always-current topological view for root cause analysis and architectural oversight.
Service Graph vs. Related Concepts
This table compares the Service Graph, a topological map of service dependencies derived from trace data, against other key observability data models and visualizations.
| Feature / Aspect | Service Graph | Trace (DAG of Spans) | Flame Graph | Interaction Graph | Topology Map |
|---|---|---|---|---|---|
Primary Data Source | Aggregated trace spans | Individual request spans | Profiling stack samples or trace spans | Agent message logs & state | Static configuration & service discovery |
Representation Type | Dynamic topological graph | Temporal directed acyclic graph (DAG) | Hierarchical call stack | Dynamic communication network | Static inventory diagram |
Core Purpose | Visualize service dependencies & health | Diagnose latency & errors in a single request | Identify performance hotspots in code paths | Model multi-agent communication & coordination | Document static infrastructure layout |
Temporal Nature | Aggregated over time (e.g., 5-min intervals) | Real-time, single request lifetime | Snapshot of execution profile | Real-time or historical agent sessions | Static, changes only on deployment |
Key Visual Elements | Nodes (services), Edges (request flows), Edge metrics (req/sec, p99 latency, error rate) | Spans (rectangles), Parent-child nesting, Timeline | Stack frames (rectangles), Width = duration or sample count | Nodes (agents), Edges (messages/triggers), Edge labels (intent) | Nodes (hosts, containers, services), Connectors |
Derived Automatically | |||||
Shows Aggregate Health (e.g., error rates) | |||||
Shows Individual Request Detail | |||||
Used For Root Cause Analysis | |||||
Used For Capacity Planning | |||||
Included in APM Tools |
Primary Use Cases for Service Graphs
A service graph is not just a static diagram; it is a dynamic, data-driven model derived from trace telemetry. Its primary value lies in automating the discovery and analysis of service dependencies to solve critical operational challenges.
Root Cause Analysis & Impact Assessment
Service graphs enable rapid root cause isolation by visually mapping failure propagation. When a downstream service fails or experiences high latency, the graph immediately identifies all upstream dependent services that are potentially affected. This allows Site Reliability Engineers (SREs) to:
- Triage incidents by understanding the blast radius before alerts cascade.
- Perform impact assessment to prioritize remediation based on business-critical dependencies.
- Correlate metrics and logs from related services using the graph's topological context, moving from symptom (e.g., high error rate in Service A) to root cause (e.g., database timeout in Service D).
Architecture Validation & Drift Detection
Service graphs act as a ground-truth representation of the runtime architecture, which often diverges from static design documents. This is critical for validating deployment integrity and detecting architecture drift. Use cases include:
- Validating canary or blue-green deployments: Ensuring new service versions connect to the correct dependencies.
- Identifying shadow IT or rogue services: Discovering undocumented services communicating in the environment.
- Enforcing architectural guardrails: Detecting violations of intended communication patterns, such as a service directly query a database it should not access.
- Documentation automation: Generating always-current architecture diagrams from live observability data.
Capacity Planning & Dependency Mapping
By analyzing the request volume and latency between nodes, service graphs provide quantitative data for informed capacity planning. Engineering leaders can:
- Identify critical bottlenecks and single points of failure in the dependency chain.
- Model the impact of scaling a specific service on its dependencies and upstream callers.
- Create accurate dependency maps for change management processes, answering 'What will break if we take Service X down?'
- Optimize resource allocation by understanding which service communications are most latency-sensitive or data-intensive.
Security & Compliance Auditing
The graph provides a continuous audit trail of allowed and actual service communications, which is foundational for a zero-trust security model. Security teams use it to:
- Enforce network security policies: Compare runtime communication paths against baseline policies to detect anomalies.
- Investigate security incidents: Trace the path of a potentially compromised request through the system.
- Support compliance audits: Provide evidence of controlled service interactions and data flow boundaries.
- Detect lateral movement potential: Visualize how an attacker could move from a breached service to other parts of the system based on real dependencies.
Performance Optimization & SLO Management
Service graphs contextualize performance metrics, enabling data-driven optimization. By overlaying golden signals like latency, error rate, and traffic on each edge, teams can:
- Pinpoint the source of latency degradation in a call chain, distinguishing between service processing time and network latency.
- Define and monitor Service Level Objectives (SLOs) for dependency health, moving beyond simple uptime to include performance of critical downstream services.
- Optimize call patterns: Identify and eliminate unnecessary serial calls or circular dependencies that degrade user experience.
- Conduct what-if analysis: Simulate the performance impact of restructuring service dependencies.
Onboarding & Operational Awareness
For new engineers or during incident response, a live service graph is an indispensable tool for building mental models of complex distributed systems. It provides:
- Immediate system comprehension without needing to consult outdated wikis.
- Context for alerts: An alert on a service is understood in relation to its dependencies and consumers.
- A shared visual language for discussing system topology across development, operations, and leadership teams.
- Faster onboarding by visually answering 'How does this system work?' and 'What depends on my service?'
Frequently Asked Questions
A service graph is a topological map derived from trace data that visually represents the services in a system and the directional request flows (dependencies) between them. This FAQ addresses common questions about its construction, use, and role in observability.
A service graph is a dynamic, topological map that visually represents the services within a distributed system and the directional request flows (dependencies) between them. It is not manually defined but is generated automatically by analyzing distributed trace data collected from instrumented applications. As traces flow through an observability backend, algorithms aggregate individual span data to infer service-level interactions. For each unique service identified by its name or host, nodes are created. Directed edges are then drawn between nodes based on the parent-child relationships observed in spans, where a span in Service A calls an operation in Service B. This process continuously updates the graph in near real-time, reflecting the current architecture and traffic patterns.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A service graph is a derived visualization, but it is built from and interacts with several core primitives and systems in the observability stack. These related terms define the components and processes that make service graph generation possible.
Distributed Tracing
Distributed tracing is the foundational methodology for observing requests as they propagate through a distributed system. It instruments and correlates work—represented as spans—across multiple services to understand performance and diagnose issues. A service graph is a topological aggregation of this trace data.
- Core Purpose: Provides a detailed, request-centric view of system behavior.
- Data Source: The raw span and trace data from which service graphs are computationally derived.
- Contrast: While a trace shows the lifecycle of a single request, a service graph shows the aggregate relationships between all services over many requests.
Span
A span is the fundamental unit of work in distributed tracing. It represents a named, timed operation corresponding to a contiguous segment of work within a service (e.g., a function call, database query, or HTTP request).
- Building Block: Spans are the atomic nodes from which traces are built and, by aggregation, service graphs are inferred.
- Context Carrier: Each span contains a span context (trace ID, span ID) that enables the reconstruction of call paths.
- Attributes: Spans carry key-value metadata (e.g.,
http.method,db.query) that is often used to label and filter dependencies in the service graph.
Trace
A trace is a collection of spans that represents the complete end-to-end path of a single request as it travels through a distributed system. The spans in a trace form a directed acyclic graph (DAG) that models the causal and temporal relationships between operations.
- Request View: A trace provides the vertical, detailed story of one transaction.
- Graph Input: Service graph algorithms analyze millions of traces to statistically identify stable service dependencies (e.g., Service A calls Service B in 95% of traces).
- Correlation Key: The trace ID glues all spans together, which is essential for backend systems to perform the aggregation that creates a service graph.
Span Kind
Span kind is a semantic attribute that classifies a span's role within a trace, such as CLIENT, SERVER, PRODUCER, CONSUMER, or INTERNAL. This classification is critical for service graph generation.
- Dependency Inference: Graph algorithms use span kinds to determine directionality. A
CLIENTspan in Service A and a correspondingSERVERspan in Service B explicitly defines a dependency from A → B. - Internal vs. External:
INTERNALspans represent work within a service boundary and are typically collapsed into the service node, whileCLIENT/SERVERspans define the edges between nodes.
Propagator
A propagator is a component in a tracing library responsible for injecting trace context into outbound requests and extracting it from inbound requests. This maintains trace continuity across service boundaries, which is essential for building a connected service graph.
- Mechanism: Implements specific wire formats like W3C Trace Context or B3 Propagation using HTTP headers or messaging metadata.
- Graph Integrity: Without proper context propagation, spans become disconnected orphans, breaking the trace and preventing the service graph from accurately modeling dependencies between services.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us