Inferensys

Glossary

Golden Signals

Golden signals are the four key metrics—latency, traffic, errors, and saturation—used to monitor the health and performance of a distributed service, providing a high-level view of its operational state.
Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.
PRODUCTION CANARY ANALYSIS

What is Golden Signals?

Golden Signals are the four key metrics—latency, traffic, errors, and saturation—used to monitor the health and performance of a distributed service, providing a high-level view of its operational state.

Golden Signals are a foundational concept in Site Reliability Engineering (SRE) and MLOps for monitoring distributed systems. They consist of four primary metrics: latency (time to serve a request), traffic (demand on the system), errors (rate of failed requests), and saturation (utilization of system resources). These signals provide a comprehensive, high-level dashboard of a service's health, enabling engineers to quickly diagnose issues without being overwhelmed by data. In production canary analysis, these metrics are the primary indicators used to compare a new model deployment against a stable baseline.

The power of Golden Signals lies in their universality and sufficiency. By focusing on these four categories, teams can establish clear Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for AI-powered services. For instance, a canary deployment verdict in an Automated Canary Analysis (ACA) system is often based on statistically significant changes in these signals, such as increased error rates or latency percentiles. This framework directly supports Evaluation-Driven Development by providing the quantitative benchmarks needed to validate model performance and stability in live environments before a full rollout.

FOUNDATIONAL METRICS

The Four Golden Signals Explained

The four golden signals—latency, traffic, errors, and saturation—are the essential metrics for monitoring the health and performance of any distributed service, providing a comprehensive, high-level view of its operational state.

01

Latency

Latency measures the time required to service a request. It is the primary indicator of user-perceived performance.

  • Focus on tail latency: While average latency is useful, the 95th or 99th percentile (p95, p99) is critical for understanding worst-case user experience.
  • Distinguish success vs. failure: Track latency for successful requests separately from failed ones, as errors often return quickly, skewing the metric.
  • Example: An API with a p99 latency of 2 seconds means 99% of requests complete within 2 seconds; the remaining 1% are slower, potentially causing user frustration.
p95/p99
Critical Percentiles
02

Traffic

Traffic quantifies the demand placed on your service, typically measured as requests per second (RPS/QPS), network I/O, or concurrent sessions.

  • Service-specific metrics: For a web server, it's HTTP requests/sec. For a database, it could be transactions/sec. For a streaming service, it's network bytes/sec.
  • Correlates with other signals: A spike in traffic often correlates with increased latency and errors. Understanding baseline traffic patterns is essential for capacity planning.
  • Use for scaling: Traffic is the primary input for autoscaling policies, triggering the addition or removal of service instances to meet demand.
RPS/QPS
Common Unit
03

Errors

Errors track the rate of requests that fail, either explicitly (HTTP 5xx, gRPC internal errors) or implicitly (HTTP 200 OK with wrong or degraded content).

  • Explicit vs. Implicit: Monitor both hard failures (e.g., 500 errors, timeouts) and soft failures (e.g., successful responses with invalid data, high latency that triggers client-side timeouts).
  • Error budget consumption: The error rate directly consumes your service's error budget (1 - SLO). A sustained high error rate signals an imminent breach of reliability commitments.
  • Golden signal for canaries: A rising error rate in a canary deployment compared to the baseline is often the fastest indicator of a problematic release.
SLO-Driven
Tied to Reliability
04

Saturation

Saturation measures how "full" your service is, indicating the utilization of its most constrained resource (the bottleneck).

  • Resource-focused: This could be CPU utilization, memory consumption, disk I/O queue length, or network bandwidth. The key is identifying the limiting resource.
  • Proactive signal: Saturation often increases before latency degrades or errors spike. A service at 100% saturation has no capacity to handle traffic spikes, leading to cascading failures.
  • Example Metrics: CPU usage >80%, memory swap rate, disk queue length, or network interface congestion. The saturation threshold is service-dependent.
Bottleneck
Key Focus
05

Application in Canary Analysis

In Production Canary Analysis, the four golden signals are compared between the baseline (stable) deployment and the canary (new) deployment.

  • Automated Comparison: Tools like Kayenta or Flagger statistically analyze differences in latency distributions, error rates, and traffic patterns between control and canary groups.
  • Deployment Verdict: A significant degradation in any golden signal (e.g., higher p99 latency, increased error percentage, or abnormal resource saturation) triggers an automated rollback.
  • Holistic Health View: Together, they provide a complete picture of whether the new model or service version performs as well as or better than the current one under real load.
06

Beyond Infrastructure: AI-Specific Signals

For AI/ML services, the golden signals framework expands to include model-specific quality metrics.

  • Latency: Model inference time, token generation speed.
  • Traffic: Predictions per second, token throughput.
  • Errors: Inference failures, hardware (GPU) errors.
  • Saturation: GPU memory utilization, accelerator compute load.
  • Augmented Signals: Must also monitor model performance drift (e.g., prediction score distribution shifts), hallucination rates for LLMs, and business metric impact (e.g., conversion rate in a recommendation canary).
MONITORING FOCUS

Golden Signals for AI vs. Traditional Services

A comparison of the four canonical Golden Signals—latency, traffic, errors, and saturation—as applied to traditional web services versus AI/ML-powered services, highlighting the shift in monitoring priorities and metric definitions.

SignalTraditional Service MonitoringAI/ML Service MonitoringKey Differences

Latency

Request/response time (p95, p99). Focus on network and service processing.

Time-to-first-token (TTFT) & inter-token latency. Dominated by model inference time and GPU/TPU queuing.

Shift from network-bound to compute-bound; critical to separate streaming token latency from total request time.

Traffic

Requests per second (RPS), query volume. Measures load on stateless endpoints.

Tokens per second (TPS), concurrent sessions. Must account for highly variable input/output lengths and context window usage.

Unit changes from discrete requests to continuous token streams; load is non-linear with respect to input size.

Errors

HTTP 4xx/5xx status codes, failed database transactions, timeouts.

Model-specific failures: hallucinations, policy violations, malformed JSON outputs, context window overflows, GPU out-of-memory (OOM) errors.

Errors are often semantic or functional (incorrect content) rather than protocol-level; requires content validation beyond HTTP codes.

Saturation

CPU utilization, memory usage, disk I/O, database connection pools.

GPU/TPU utilization, VRAM usage, KV cache memory pressure, batch queue depth. Bottleneck is accelerator memory/compute.

Primary resource constraints shift from general compute/IO to specialized hardware (GPU memory bandwidth, SRAM).

New Critical Signal: Quality

Not a core Golden Signal. Implied by error rate and business logs.

A primary signal. Measured via: correctness scores, hallucination rate, RAG precision/recall, instruction-following accuracy.

Must be monitored with the same rigor as errors; requires automated evaluation pipelines and can drift independently of system health.

New Critical Signal: Cost

Indirectly via infrastructure scaling. Roughly linear with traffic.

A primary, non-linear signal. Measured as cost per token, cost per session. Driven by model size, sequence length, and accelerator type.

Direct business metric; small changes in prompt design or user behavior can cause order-of-magnitude cost variance.

Alerting Thresholds

Based on static, historical baselines (e.g., latency > 200ms).

Must be dynamic and context-aware. Baseline varies by model version, input complexity, and accelerator load. Requires statistical drift detection.

Static thresholds fail; must use anomaly detection on metrics that have multi-modal distributions (e.g., latency for short vs. long prompts).

Root Cause Analysis

Tracing through service mesh, logs, and database queries.

Tracing through inference stack: prompt context, retrieved documents, model parameters, quantization level, and accelerator scheduler states.

Debugging requires visibility into the model's "reasoning" (e.g., attention patterns, retrieved context) and hardware scheduling.

PRODUCTION CANARY ANALYSIS

How Golden Signals Power Canary Analysis

Golden Signals provide the fundamental, high-level metrics required to perform automated, statistically rigorous canary analysis of new AI model deployments.

Golden Signals are the four universal metrics—latency, traffic, errors, and saturation—that provide a comprehensive, high-level view of any distributed service's health. In canary analysis, these signals are collected from both the stable baseline (control) and the new model version (canary) and compared using statistical tests. This comparison forms the objective basis for an automated deployment verdict, determining if the canary performs within acceptable bounds before a full rollout.

For AI services, these signals are adapted: latency measures inference time, traffic tracks request volume, errors capture failed inferences or hallucinations, and saturation monitors resource utilization like GPU memory. By defining Service Level Objectives (SLOs) for these signals, teams establish clear, quantitative success criteria. Automated analysis tools like Kayenta then evaluate the canary against these criteria, enabling data-driven promotion or rollback decisions that minimize risk during model updates.

GOLDEN SIGNALS

Frequently Asked Questions

Golden signals are the four foundational metrics used to monitor the health of any distributed service or AI system. This FAQ addresses common questions about their definition, application, and role in modern MLOps and production canary analysis.

The four golden signals are latency, traffic, errors, and saturation. These metrics provide a high-level, comprehensive view of a service's operational health by measuring how fast it responds, how much demand it handles, how often it fails, and how fully its resources are utilized. Originating from Google's Site Reliability Engineering (SRE) practices, they are considered 'golden' because they are universally applicable, easy to understand, and sufficient to identify most production issues without being overwhelmed by data.

  • Latency: The time taken to service a request. It's critical to distinguish between the latency of successful requests and that of failed ones.
  • Traffic: A measure of demand on the system, often quantified as requests per second, network I/O, or concurrent sessions.
  • Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s) or implicitly (e.g., incorrect content from an AI model).
  • Saturation: How 'full' a service is, measuring the utilization of constrained resources like CPU, memory, I/O, or, for AI models, GPU VRAM or token capacity.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.