Inferensys

Glossary

Golden Signal

A golden signal is one of four fundamental metrics—latency, traffic, errors, and saturation—used in site reliability engineering (SRE) to comprehensively monitor the health and performance of a service.
Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.
SRE FOUNDATION

What is a Golden Signal?

A golden signal is one of four fundamental metrics used in Site Reliability Engineering (SRE) to comprehensively monitor the health and performance of a service.

A golden signal is one of four cardinal metrics—latency, traffic, errors, and saturation—that provide a complete, high-level view of a service's health from the user's perspective. Originating from Google's SRE practices, these signals are considered 'golden' because they are universally applicable, easy to interpret, and sufficient to understand the state of any service. For AI-powered services, this translates to monitoring model inference latency, request throughput, generation error rates, and GPU memory saturation.

In the context of AI SLO/SLI definition, golden signals form the empirical basis for Service Level Indicators (SLIs). Latency tracks Time To First Token (TTFT) and Time Per Output Token (TPOT). Traffic measures query volume. Errors capture failed inferences or hallucinations. Saturation monitors resource utilization like GPU memory. By instrumenting these four signals, engineering teams can define precise Service Level Objectives (SLOs) and manage error budgets to ensure reliable, user-centric AI service delivery.

SITE RELIABILITY ENGINEERING

The Four Golden Signals

The Four Golden Signals—latency, traffic, errors, and saturation—are the fundamental metrics defined by Google's Site Reliability Engineering (SRE) practice for comprehensively monitoring the health and performance of any service.

01

Latency

Latency measures the time it takes to service a request. For AI services, this is critical and often broken down into distinct phases:

  • Model Inference Latency: Total time from input submission to output generation.
  • Time To First Token (TTFT): For streaming LLM responses, the delay until the first token is emitted.
  • Time Per Output Token (TPOT): The throughput for generating subsequent tokens. Monitoring percentile latency (p50, p95, p99) is essential, as p99 (tail latency) often dictates user-perceived performance. High latency directly violates user-centric SLOs.
02

Traffic

Traffic quantifies the demand placed on your service. For AI systems, this is more nuanced than simple request counts.

  • Queries Per Second (QPS): The raw volume of inference requests.
  • Concurrent Users/Sessions: Number of simultaneous active interactions.
  • Input/Output Volume: Size of prompts and generated completions, which impacts computational load. Understanding traffic patterns is required for capacity planning, auto-scaling, and correlating load with other signals like latency and error rates.
03

Errors

Errors measure the rate of requests that fail. In AI services, failures extend beyond HTTP 5xx codes to include model-specific failures.

  • Service Errors: Failed API calls, timeouts, and infrastructure failures.
  • Model Errors: Structured output validation failures, context window overflows, or resource exhaustion.
  • Quality Errors: Outputs that violate SLOs for hallucination rate or answer faithfulness in RAG systems. Tracking error rate against an error budget is fundamental to SRE's risk-management approach.
04

Saturation

Saturation measures how "full" your service is, indicating resource exhaustion before errors or latency spikes occur. It's a measure of system utilization.

  • Hardware Metrics: GPU/CPU utilization, memory pressure, and I/O bandwidth.
  • Service-Specific Limits: Queue lengths, continuous batching efficiency in LLM servers, or token generation buffer saturation.
  • Derived Metrics: Scaling factor (demand/capacity). Monitoring saturation provides the leading indicator needed for proactive scaling, preventing tail latency amplification and cascading failures.
05

Applying Signals to AI Services

AI services require adapting the golden signals to model-specific behaviors. Key considerations include:

  • Defining SLIs/SLOs: An SLO for model inference latency or an SLO for hallucination rate translates golden signals into actionable reliability targets.
  • Agentic Systems: For autonomous agents, traffic may be tasks/hour, errors could be agent task success rate, and saturation might involve orchestration engine queue depth.
  • Observability Integration: These signals must feed into multi-window alerting based on burn rate to protect SLOs.
06

Beyond the Basics: AI-Specific Extensions

While the four signals are sufficient for most services, AI systems often require supplemental signals for full observability:

  • Quality & Correctness: Metrics like Retrieval Precision@K for RAG or instruction following accuracy.
  • Data Health: Data drift detection to monitor input distribution shifts.
  • Business Impact: SLO for business metric correlation linking latency to user conversion.
  • Cost: SLO for cost efficiency (e.g., cost per inference) to balance performance with expenditure. These extensions ensure monitoring captures both the operational and functional health of AI systems.
SLO/SLI DEFINITION FOR AI

Golden Signals for AI & ML Services

A core concept from Site Reliability Engineering (SRE) adapted for monitoring the health of artificial intelligence and machine learning services.

A golden signal is one of four fundamental metrics—latency, traffic, errors, and saturation—used to comprehensively monitor the health and performance of a service. Originating in Site Reliability Engineering (SRE), these signals provide a complete, high-level view of a system's behavior from a user's perspective, forming the empirical basis for defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

For AI services, these signals are specialized: latency includes Time To First Token (TTFT) and Time Per Output Token (TPOT); traffic measures queries per second (QPS); errors track failed inferences or hallucinations; and saturation monitors GPU memory and compute utilization. Monitoring these signals is essential for evaluation-driven development, enabling teams to quantify performance, detect data drift, and maintain reliable model inference that meets business-critical SLOs.

MONITORING METRICS

Traditional vs. AI Service Golden Signals

This table contrasts the four canonical Golden Signals used in traditional SRE with their adapted counterparts for monitoring AI-powered services, highlighting the shift from infrastructure health to model quality and reasoning integrity.

Golden SignalTraditional Service (SRE Definition)AI-Powered Service (Adapted Definition)Primary Measurement Target

Latency

The time it takes to service a request.

The total time to produce a final, validated model output, including retrieval, inference, and any agentic reasoning steps.

End-to-end request duration (p95, p99)

Traffic

The demand/load placed on the service, measured in queries per second (QPS) or concurrent connections.

The rate and volume of inference requests, often segmented by model, endpoint, or user journey. Includes token throughput (Tokens/sec).

Requests Per Second (RPS), Token Throughput

Errors

The rate of failed requests, typically HTTP 5xx or 4xx client errors.

The rate of requests where the output fails quality or correctness checks, including hallucinations, safety violations, context overflows, and agentic execution failures.

Error Rate (Failed Requests / Total Requests)

Saturation

The utilization of a service's constrained resources (e.g., CPU, memory, I/O).

The utilization of constrained, scalable resources critical for AI performance, primarily GPU/accelerator memory and compute. Measures 'headroom' before quality degrades.

GPU Memory Utilization %, KV Cache Pressure

GOLDEN SIGNAL

Frequently Asked Questions

A golden signal is one of four fundamental metrics—latency, traffic, errors, and saturation—used in site reliability engineering (SRE) to comprehensively monitor the health and performance of a service. This FAQ addresses its application in AI and machine learning systems.

A golden signal is one of four cardinal metrics—latency, traffic, errors, and saturation—collectively used to monitor the health and performance of a service, providing a comprehensive, high-level view of its operational state. Originating from Google's Site Reliability Engineering (SRE) practices, these signals are considered "golden" because they are sufficient to understand the user experience and system behavior without being overwhelmed by data. Latency measures the time to service a request. Traffic quantifies demand (e.g., requests per second). Errors track the rate of failed requests. Saturation indicates how "full" a resource is, like CPU or memory utilization. For AI services, these translate directly to metrics like model inference latency, queries per second (QPS), model error rates, and GPU memory saturation.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.