Glossary

Golden Signal

A golden signal is one of four fundamental metrics—latency, traffic, errors, and saturation—used in site reliability engineering (SRE) to comprehensively monitor the health and performance of a service.

Get in touch Learn more

Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.

SRE FOUNDATION

What is a Golden Signal?

A golden signal is one of four fundamental metrics used in Site Reliability Engineering (SRE) to comprehensively monitor the health and performance of a service.

A golden signal is one of four cardinal metrics—latency, traffic, errors, and saturation—that provide a complete, high-level view of a service's health from the user's perspective. Originating from Google's SRE practices, these signals are considered 'golden' because they are universally applicable, easy to interpret, and sufficient to understand the state of any service. For AI-powered services, this translates to monitoring model inference latency, request throughput, generation error rates, and GPU memory saturation.

In the context of AI SLO/SLI definition, golden signals form the empirical basis for Service Level Indicators (SLIs). Latency tracks Time To First Token (TTFT) and Time Per Output Token (TPOT). Traffic measures query volume. Errors capture failed inferences or hallucinations. Saturation monitors resource utilization like GPU memory. By instrumenting these four signals, engineering teams can define precise Service Level Objectives (SLOs) and manage error budgets to ensure reliable, user-centric AI service delivery.

SITE RELIABILITY ENGINEERING

The Four Golden Signals

The Four Golden Signals—latency, traffic, errors, and saturation—are the fundamental metrics defined by Google's Site Reliability Engineering (SRE) practice for comprehensively monitoring the health and performance of any service.

Latency

Latency measures the time it takes to service a request. For AI services, this is critical and often broken down into distinct phases:

Model Inference Latency: Total time from input submission to output generation.
Time To First Token (TTFT): For streaming LLM responses, the delay until the first token is emitted.
Time Per Output Token (TPOT): The throughput for generating subsequent tokens. Monitoring percentile latency (p50, p95, p99) is essential, as p99 (tail latency) often dictates user-perceived performance. High latency directly violates user-centric SLOs.

Traffic

Traffic quantifies the demand placed on your service. For AI systems, this is more nuanced than simple request counts.

Queries Per Second (QPS): The raw volume of inference requests.
Concurrent Users/Sessions: Number of simultaneous active interactions.
Input/Output Volume: Size of prompts and generated completions, which impacts computational load. Understanding traffic patterns is required for capacity planning, auto-scaling, and correlating load with other signals like latency and error rates.

Errors

Errors measure the rate of requests that fail. In AI services, failures extend beyond HTTP 5xx codes to include model-specific failures.

Service Errors: Failed API calls, timeouts, and infrastructure failures.
Model Errors: Structured output validation failures, context window overflows, or resource exhaustion.
Quality Errors: Outputs that violate SLOs for hallucination rate or answer faithfulness in RAG systems. Tracking error rate against an error budget is fundamental to SRE's risk-management approach.

Saturation

Saturation measures how "full" your service is, indicating resource exhaustion before errors or latency spikes occur. It's a measure of system utilization.

Hardware Metrics: GPU/CPU utilization, memory pressure, and I/O bandwidth.
Service-Specific Limits: Queue lengths, continuous batching efficiency in LLM servers, or token generation buffer saturation.
Derived Metrics: Scaling factor (demand/capacity). Monitoring saturation provides the leading indicator needed for proactive scaling, preventing tail latency amplification and cascading failures.

Applying Signals to AI Services

AI services require adapting the golden signals to model-specific behaviors. Key considerations include:

Defining SLIs/SLOs: An SLO for model inference latency or an SLO for hallucination rate translates golden signals into actionable reliability targets.
Agentic Systems: For autonomous agents, traffic may be tasks/hour, errors could be agent task success rate, and saturation might involve orchestration engine queue depth.
Observability Integration: These signals must feed into multi-window alerting based on burn rate to protect SLOs.

Beyond the Basics: AI-Specific Extensions

While the four signals are sufficient for most services, AI systems often require supplemental signals for full observability:

Quality & Correctness: Metrics like Retrieval Precision@K for RAG or instruction following accuracy.
Data Health: Data drift detection to monitor input distribution shifts.
Business Impact: SLO for business metric correlation linking latency to user conversion.
Cost: SLO for cost efficiency (e.g., cost per inference) to balance performance with expenditure. These extensions ensure monitoring captures both the operational and functional health of AI systems.

SLO/SLI DEFINITION FOR AI

Golden Signals for AI & ML Services

A core concept from Site Reliability Engineering (SRE) adapted for monitoring the health of artificial intelligence and machine learning services.

A golden signal is one of four fundamental metrics—latency, traffic, errors, and saturation—used to comprehensively monitor the health and performance of a service. Originating in Site Reliability Engineering (SRE), these signals provide a complete, high-level view of a system's behavior from a user's perspective, forming the empirical basis for defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

For AI services, these signals are specialized: latency includes Time To First Token (TTFT) and Time Per Output Token (TPOT); traffic measures queries per second (QPS); errors track failed inferences or hallucinations; and saturation monitors GPU memory and compute utilization. Monitoring these signals is essential for evaluation-driven development, enabling teams to quantify performance, detect data drift, and maintain reliable model inference that meets business-critical SLOs.

MONITORING METRICS

Traditional vs. AI Service Golden Signals

This table contrasts the four canonical Golden Signals used in traditional SRE with their adapted counterparts for monitoring AI-powered services, highlighting the shift from infrastructure health to model quality and reasoning integrity.

Golden Signal	Traditional Service (SRE Definition)	AI-Powered Service (Adapted Definition)	Primary Measurement Target
Latency	The time it takes to service a request.	The total time to produce a final, validated model output, including retrieval, inference, and any agentic reasoning steps.	End-to-end request duration (p95, p99)
Traffic	The demand/load placed on the service, measured in queries per second (QPS) or concurrent connections.	The rate and volume of inference requests, often segmented by model, endpoint, or user journey. Includes token throughput (Tokens/sec).	Requests Per Second (RPS), Token Throughput
Errors	The rate of failed requests, typically HTTP 5xx or 4xx client errors.	The rate of requests where the output fails quality or correctness checks, including hallucinations, safety violations, context overflows, and agentic execution failures.	Error Rate (Failed Requests / Total Requests)
Saturation	The utilization of a service's constrained resources (e.g., CPU, memory, I/O).	The utilization of constrained, scalable resources critical for AI performance, primarily GPU/accelerator memory and compute. Measures 'headroom' before quality degrades.	GPU Memory Utilization %, KV Cache Pressure

GOLDEN SIGNAL

Frequently Asked Questions

A golden signal is one of four cardinal metrics—latency, traffic, errors, and saturation—collectively used to monitor the health and performance of a service, providing a comprehensive, high-level view of its operational state. Originating from Google's Site Reliability Engineering (SRE) practices, these signals are considered "golden" because they are sufficient to understand the user experience and system behavior without being overwhelmed by data. Latency measures the time to service a request. Traffic quantifies demand (e.g., requests per second). Errors track the rate of failed requests. Saturation indicates how "full" a resource is, like CPU or memory utilization. For AI services, these translate directly to metrics like model inference latency, queries per second (QPS), model error rates, and GPU memory saturation.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

SLO/SLI DEFINITION FOR AI

Related Terms

The Golden Signal framework provides a foundation for monitoring. These related concepts define the specific targets, agreements, and operational practices built upon that foundation.

Service Level Objective (SLO)

A Service Level Objective (SLO) is a quantitative target for the reliability, performance, or quality of a service. It is derived from one or more Service Level Indicators (SLIs) and defines the acceptable level of service over a specific time window.

Core Function: Translates raw metrics (SLIs) into a business-agreed target (e.g., "99.9% of requests must have latency < 200ms this quarter").
AI Application: For AI services, SLOs can target model inference latency, hallucination rate, or retrieval precision, moving beyond traditional infrastructure metrics.
Key Property: SLOs define "how good is good enough" and create an error budget for managing risk.

Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a directly measurable metric that quantifies a specific aspect of a service's performance or reliability. SLIs are the raw measurements that feed into Service Level Objectives (SLOs).

Golden Signal Examples: Latency (p95, p99), error rate (5xx responses), traffic (queries per second), saturation (GPU utilization %).
AI-Specific SLIs: Include Time To First Token (TTFT), Time Per Output Token (TPOT), and answer faithfulness score.
Measurement: SLIs must be measured from the user's perspective wherever possible to reflect true experience.

Error Budget

An error budget is the explicit, quantified amount of unreliability a service team is allowed within a Service Level Objective's (SLO) time window. It is calculated as 100% - SLO Target.

Purpose: Serves as a shared resource for balancing velocity and stability. Spending the budget on launches is acceptable; exhausting it triggers a focus on reliability.
Management: Teams track burn rate—the speed at which the error budget is consumed—to alert on sustained degradation versus brief spikes.
Cultural Tool: Transforms reliability from an abstract goal into a concrete, spendable resource for product and engineering decisions.

Service Level Agreement (SLA)

A Service Level Agreement (SLA) is a formal contract between a service provider and a customer that defines the minimum expected level of service, often including financial penalties or remedies if commitments are not met.

Relationship to SLOs: An SLA is typically an external business contract, while SLOs are internal engineering targets. SLOs are often set stricter than the SLA to provide a safety margin.
Enforceability: SLAs involve legal and commercial consequences (e.g., service credits), whereas missed SLOs trigger internal operational reviews.
AI Context: For AI-as-a-Service offerings, SLAs may cover uptime, throughput, and specific quality metrics like maximum hallucination rates.

Burn Rate

Burn rate quantifies the speed at which a service consumes its error budget. It is calculated as the percentage of the total error budget consumed per unit of time (e.g., 10% per hour).

Alerting Strategy: Forms the basis for multi-window alerting. A high burn rate for a short period may indicate a transient spike, while a moderate burn rate over a long period signals sustained degradation likely to violate the SLO.
Proactive Management: Allows teams to respond to reliability issues before the error budget is fully exhausted and the SLO is breached.
Visualization: Often plotted on a graph showing budget remaining over time, with trend lines predicting time until exhaustion.

Critical User Journey (CUJ)

A Critical User Journey (CUJ) is a specific, high-value sequence of user interactions that is essential to the user's success with a service. SLOs and SLIs should be defined to protect these journeys.

Foundation for SLOs: Instead of monitoring every API endpoint, teams identify 5-10 key CUJs (e.g., "User submits a query and receives a summarized answer") and instrument their SLIs.
AI Example: For a RAG system, a CUJ could be: User asks a question → System retrieves 3 relevant documents → LLM synthesizes a grounded answer. Latency and error SLIs would track this entire flow.
User-Centric Design: Ensures monitoring focuses on what matters to the business and end-user, not just internal system health.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.