Golden Signals are a foundational concept in Site Reliability Engineering (SRE) and MLOps for monitoring distributed systems. They consist of four primary metrics: latency (time to serve a request), traffic (demand on the system), errors (rate of failed requests), and saturation (utilization of system resources). These signals provide a comprehensive, high-level dashboard of a service's health, enabling engineers to quickly diagnose issues without being overwhelmed by data. In production canary analysis, these metrics are the primary indicators used to compare a new model deployment against a stable baseline.
Glossary
Golden Signals

What is Golden Signals?
Golden Signals are the four key metrics—latency, traffic, errors, and saturation—used to monitor the health and performance of a distributed service, providing a high-level view of its operational state.
The power of Golden Signals lies in their universality and sufficiency. By focusing on these four categories, teams can establish clear Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for AI-powered services. For instance, a canary deployment verdict in an Automated Canary Analysis (ACA) system is often based on statistically significant changes in these signals, such as increased error rates or latency percentiles. This framework directly supports Evaluation-Driven Development by providing the quantitative benchmarks needed to validate model performance and stability in live environments before a full rollout.
The Four Golden Signals Explained
The four golden signals—latency, traffic, errors, and saturation—are the essential metrics for monitoring the health and performance of any distributed service, providing a comprehensive, high-level view of its operational state.
Latency
Latency measures the time required to service a request. It is the primary indicator of user-perceived performance.
- Focus on tail latency: While average latency is useful, the 95th or 99th percentile (p95, p99) is critical for understanding worst-case user experience.
- Distinguish success vs. failure: Track latency for successful requests separately from failed ones, as errors often return quickly, skewing the metric.
- Example: An API with a p99 latency of 2 seconds means 99% of requests complete within 2 seconds; the remaining 1% are slower, potentially causing user frustration.
Traffic
Traffic quantifies the demand placed on your service, typically measured as requests per second (RPS/QPS), network I/O, or concurrent sessions.
- Service-specific metrics: For a web server, it's HTTP requests/sec. For a database, it could be transactions/sec. For a streaming service, it's network bytes/sec.
- Correlates with other signals: A spike in traffic often correlates with increased latency and errors. Understanding baseline traffic patterns is essential for capacity planning.
- Use for scaling: Traffic is the primary input for autoscaling policies, triggering the addition or removal of service instances to meet demand.
Errors
Errors track the rate of requests that fail, either explicitly (HTTP 5xx, gRPC internal errors) or implicitly (HTTP 200 OK with wrong or degraded content).
- Explicit vs. Implicit: Monitor both hard failures (e.g., 500 errors, timeouts) and soft failures (e.g., successful responses with invalid data, high latency that triggers client-side timeouts).
- Error budget consumption: The error rate directly consumes your service's error budget (1 - SLO). A sustained high error rate signals an imminent breach of reliability commitments.
- Golden signal for canaries: A rising error rate in a canary deployment compared to the baseline is often the fastest indicator of a problematic release.
Saturation
Saturation measures how "full" your service is, indicating the utilization of its most constrained resource (the bottleneck).
- Resource-focused: This could be CPU utilization, memory consumption, disk I/O queue length, or network bandwidth. The key is identifying the limiting resource.
- Proactive signal: Saturation often increases before latency degrades or errors spike. A service at 100% saturation has no capacity to handle traffic spikes, leading to cascading failures.
- Example Metrics: CPU usage >80%, memory swap rate, disk queue length, or network interface congestion. The saturation threshold is service-dependent.
Application in Canary Analysis
In Production Canary Analysis, the four golden signals are compared between the baseline (stable) deployment and the canary (new) deployment.
- Automated Comparison: Tools like Kayenta or Flagger statistically analyze differences in latency distributions, error rates, and traffic patterns between control and canary groups.
- Deployment Verdict: A significant degradation in any golden signal (e.g., higher p99 latency, increased error percentage, or abnormal resource saturation) triggers an automated rollback.
- Holistic Health View: Together, they provide a complete picture of whether the new model or service version performs as well as or better than the current one under real load.
Beyond Infrastructure: AI-Specific Signals
For AI/ML services, the golden signals framework expands to include model-specific quality metrics.
- Latency: Model inference time, token generation speed.
- Traffic: Predictions per second, token throughput.
- Errors: Inference failures, hardware (GPU) errors.
- Saturation: GPU memory utilization, accelerator compute load.
- Augmented Signals: Must also monitor model performance drift (e.g., prediction score distribution shifts), hallucination rates for LLMs, and business metric impact (e.g., conversion rate in a recommendation canary).
Golden Signals for AI vs. Traditional Services
A comparison of the four canonical Golden Signals—latency, traffic, errors, and saturation—as applied to traditional web services versus AI/ML-powered services, highlighting the shift in monitoring priorities and metric definitions.
| Signal | Traditional Service Monitoring | AI/ML Service Monitoring | Key Differences |
|---|---|---|---|
Latency | Request/response time (p95, p99). Focus on network and service processing. | Time-to-first-token (TTFT) & inter-token latency. Dominated by model inference time and GPU/TPU queuing. | Shift from network-bound to compute-bound; critical to separate streaming token latency from total request time. |
Traffic | Requests per second (RPS), query volume. Measures load on stateless endpoints. | Tokens per second (TPS), concurrent sessions. Must account for highly variable input/output lengths and context window usage. | Unit changes from discrete requests to continuous token streams; load is non-linear with respect to input size. |
Errors | HTTP 4xx/5xx status codes, failed database transactions, timeouts. | Model-specific failures: hallucinations, policy violations, malformed JSON outputs, context window overflows, GPU out-of-memory (OOM) errors. | Errors are often semantic or functional (incorrect content) rather than protocol-level; requires content validation beyond HTTP codes. |
Saturation | CPU utilization, memory usage, disk I/O, database connection pools. | GPU/TPU utilization, VRAM usage, KV cache memory pressure, batch queue depth. Bottleneck is accelerator memory/compute. | Primary resource constraints shift from general compute/IO to specialized hardware (GPU memory bandwidth, SRAM). |
New Critical Signal: Quality | Not a core Golden Signal. Implied by error rate and business logs. | A primary signal. Measured via: correctness scores, hallucination rate, RAG precision/recall, instruction-following accuracy. | Must be monitored with the same rigor as errors; requires automated evaluation pipelines and can drift independently of system health. |
New Critical Signal: Cost | Indirectly via infrastructure scaling. Roughly linear with traffic. | A primary, non-linear signal. Measured as cost per token, cost per session. Driven by model size, sequence length, and accelerator type. | Direct business metric; small changes in prompt design or user behavior can cause order-of-magnitude cost variance. |
Alerting Thresholds | Based on static, historical baselines (e.g., latency > 200ms). | Must be dynamic and context-aware. Baseline varies by model version, input complexity, and accelerator load. Requires statistical drift detection. | Static thresholds fail; must use anomaly detection on metrics that have multi-modal distributions (e.g., latency for short vs. long prompts). |
Root Cause Analysis | Tracing through service mesh, logs, and database queries. | Tracing through inference stack: prompt context, retrieved documents, model parameters, quantization level, and accelerator scheduler states. | Debugging requires visibility into the model's "reasoning" (e.g., attention patterns, retrieved context) and hardware scheduling. |
How Golden Signals Power Canary Analysis
Golden Signals provide the fundamental, high-level metrics required to perform automated, statistically rigorous canary analysis of new AI model deployments.
Golden Signals are the four universal metrics—latency, traffic, errors, and saturation—that provide a comprehensive, high-level view of any distributed service's health. In canary analysis, these signals are collected from both the stable baseline (control) and the new model version (canary) and compared using statistical tests. This comparison forms the objective basis for an automated deployment verdict, determining if the canary performs within acceptable bounds before a full rollout.
For AI services, these signals are adapted: latency measures inference time, traffic tracks request volume, errors capture failed inferences or hallucinations, and saturation monitors resource utilization like GPU memory. By defining Service Level Objectives (SLOs) for these signals, teams establish clear, quantitative success criteria. Automated analysis tools like Kayenta then evaluate the canary against these criteria, enabling data-driven promotion or rollback decisions that minimize risk during model updates.
Frequently Asked Questions
Golden signals are the four foundational metrics used to monitor the health of any distributed service or AI system. This FAQ addresses common questions about their definition, application, and role in modern MLOps and production canary analysis.
The four golden signals are latency, traffic, errors, and saturation. These metrics provide a high-level, comprehensive view of a service's operational health by measuring how fast it responds, how much demand it handles, how often it fails, and how fully its resources are utilized. Originating from Google's Site Reliability Engineering (SRE) practices, they are considered 'golden' because they are universally applicable, easy to understand, and sufficient to identify most production issues without being overwhelmed by data.
- Latency: The time taken to service a request. It's critical to distinguish between the latency of successful requests and that of failed ones.
- Traffic: A measure of demand on the system, often quantified as requests per second, network I/O, or concurrent sessions.
- Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s) or implicitly (e.g., incorrect content from an AI model).
- Saturation: How 'full' a service is, measuring the utilization of constrained resources like CPU, memory, I/O, or, for AI models, GPU VRAM or token capacity.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Golden signals are a foundational concept within the broader practice of Production Canary Analysis. The following terms are essential for designing, executing, and evaluating controlled deployments.
Service Level Objective (SLO)
A target level of reliability or performance for a service, expressed as a measurable goal over a rolling time window. SLOs are the quantitative benchmarks against which golden signals are judged during a canary analysis.
- Example: "99.9% of requests shall have a latency under 200ms over a 30-day window."
- Relationship to Golden Signals: Defines the acceptable thresholds for latency (performance) and error rate (reliability).
- Error Budget: The allowable amount of unreliability (1 - SLO); a canary failure consumes this budget.
Traffic Splitting
The controlled routing of a defined percentage of live user requests to different versions of a service. This is the mechanism that enables a canary deployment by directing a small slice of traffic (one of the golden signals) to the new model.
- Implementation: Often managed by a service mesh (e.g., Istio VirtualService) or a deployment controller.
- Progressive Rollout: Traffic percentage is gradually increased from 1% to 100% based on successful canary analysis.
- Key Consideration: Must ensure traffic splitting is statistically representative to avoid skewed metric comparisons.
Deployment Verdict
The final decision to promote a new candidate to full production or rollback to the previous stable version. This verdict is the primary output of an Automated Canary Analysis (ACA) process that evaluates the golden signals.
- Automation: The ideal state is a fully automated verdict based on breached SLOs.
- Criteria: Based on statistical comparisons of canary vs. control metrics for errors, latency, and sometimes business KPIs.
- Rollback Trigger: A significant degradation in any golden signal typically triggers an automated rollback to limit blast radius.
Blast Radius
The scope and potential impact of a failure introduced by a new deployment. A core goal of canary analysis is to minimize the blast radius by initially exposing the change to a very small subset of users or infrastructure.
- Containment Strategy: Limited initially by low traffic percentage in a canary.
- Golden Signals as Early Warning: A spike in errors or latency in the canary group signals a problem while the impact is contained.
- Escalation: If the canary is healthy, the blast radius is intentionally increased through a progressive rollout.
Canary Metrics
The specific set of quantitative measurements collected and analyzed during a canary deployment. While the four golden signals provide a universal health check, canary metrics are often extended to include:
- Business KPIs: Conversion rates, transaction values, or user engagement scores.
- Domain-Specific Signals: For AI models, this includes prediction drift, hallucination rates, or confidence score distributions.
- Infrastructure Saturation: Memory, CPU, and GPU utilization beyond generic service saturation.
These metrics are visualized in a canary analysis dashboard for real-time decision-making.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us