Glossary

Golden Signals

Golden signals are the four key metrics—latency, traffic, errors, and saturation—used to monitor the health and performance of a distributed service, providing a high-level view of its operational state.

Get in touch Learn more

Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.

PRODUCTION CANARY ANALYSIS

What is Golden Signals?

Golden Signals are the four key metrics—latency, traffic, errors, and saturation—used to monitor the health and performance of a distributed service, providing a high-level view of its operational state.

Golden Signals are a foundational concept in Site Reliability Engineering (SRE) and MLOps for monitoring distributed systems. They consist of four primary metrics: latency (time to serve a request), traffic (demand on the system), errors (rate of failed requests), and saturation (utilization of system resources). These signals provide a comprehensive, high-level dashboard of a service's health, enabling engineers to quickly diagnose issues without being overwhelmed by data. In production canary analysis, these metrics are the primary indicators used to compare a new model deployment against a stable baseline.

The power of Golden Signals lies in their universality and sufficiency. By focusing on these four categories, teams can establish clear Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for AI-powered services. For instance, a canary deployment verdict in an Automated Canary Analysis (ACA) system is often based on statistically significant changes in these signals, such as increased error rates or latency percentiles. This framework directly supports Evaluation-Driven Development by providing the quantitative benchmarks needed to validate model performance and stability in live environments before a full rollout.

FOUNDATIONAL METRICS

The Four Golden Signals Explained

The four golden signals—latency, traffic, errors, and saturation—are the essential metrics for monitoring the health and performance of any distributed service, providing a comprehensive, high-level view of its operational state.

Latency

Latency measures the time required to service a request. It is the primary indicator of user-perceived performance.

Focus on tail latency: While average latency is useful, the 95th or 99th percentile (p95, p99) is critical for understanding worst-case user experience.
Distinguish success vs. failure: Track latency for successful requests separately from failed ones, as errors often return quickly, skewing the metric.
Example: An API with a p99 latency of 2 seconds means 99% of requests complete within 2 seconds; the remaining 1% are slower, potentially causing user frustration.

p95/p99

Critical Percentiles

Traffic

Traffic quantifies the demand placed on your service, typically measured as requests per second (RPS/QPS), network I/O, or concurrent sessions.

Service-specific metrics: For a web server, it's HTTP requests/sec. For a database, it could be transactions/sec. For a streaming service, it's network bytes/sec.
Correlates with other signals: A spike in traffic often correlates with increased latency and errors. Understanding baseline traffic patterns is essential for capacity planning.
Use for scaling: Traffic is the primary input for autoscaling policies, triggering the addition or removal of service instances to meet demand.

RPS/QPS

Common Unit

Errors

Errors track the rate of requests that fail, either explicitly (HTTP 5xx, gRPC internal errors) or implicitly (HTTP 200 OK with wrong or degraded content).

Explicit vs. Implicit: Monitor both hard failures (e.g., 500 errors, timeouts) and soft failures (e.g., successful responses with invalid data, high latency that triggers client-side timeouts).
Error budget consumption: The error rate directly consumes your service's error budget (1 - SLO). A sustained high error rate signals an imminent breach of reliability commitments.
Golden signal for canaries: A rising error rate in a canary deployment compared to the baseline is often the fastest indicator of a problematic release.

SLO-Driven

Tied to Reliability

Saturation

Saturation measures how "full" your service is, indicating the utilization of its most constrained resource (the bottleneck).

Resource-focused: This could be CPU utilization, memory consumption, disk I/O queue length, or network bandwidth. The key is identifying the limiting resource.
Proactive signal: Saturation often increases before latency degrades or errors spike. A service at 100% saturation has no capacity to handle traffic spikes, leading to cascading failures.
Example Metrics: CPU usage >80%, memory swap rate, disk queue length, or network interface congestion. The saturation threshold is service-dependent.

Bottleneck

Key Focus

Application in Canary Analysis

In Production Canary Analysis, the four golden signals are compared between the baseline (stable) deployment and the canary (new) deployment.

Automated Comparison: Tools like Kayenta or Flagger statistically analyze differences in latency distributions, error rates, and traffic patterns between control and canary groups.
Deployment Verdict: A significant degradation in any golden signal (e.g., higher p99 latency, increased error percentage, or abnormal resource saturation) triggers an automated rollback.
Holistic Health View: Together, they provide a complete picture of whether the new model or service version performs as well as or better than the current one under real load.

Beyond Infrastructure: AI-Specific Signals

For AI/ML services, the golden signals framework expands to include model-specific quality metrics.

Latency: Model inference time, token generation speed.
Traffic: Predictions per second, token throughput.
Errors: Inference failures, hardware (GPU) errors.
Saturation: GPU memory utilization, accelerator compute load.
Augmented Signals: Must also monitor model performance drift (e.g., prediction score distribution shifts), hallucination rates for LLMs, and business metric impact (e.g., conversion rate in a recommendation canary).

MONITORING FOCUS

Golden Signals for AI vs. Traditional Services

A comparison of the four canonical Golden Signals—latency, traffic, errors, and saturation—as applied to traditional web services versus AI/ML-powered services, highlighting the shift in monitoring priorities and metric definitions.

Signal	Traditional Service Monitoring	AI/ML Service Monitoring	Key Differences
Latency	Request/response time (p95, p99). Focus on network and service processing.	Time-to-first-token (TTFT) & inter-token latency. Dominated by model inference time and GPU/TPU queuing.	Shift from network-bound to compute-bound; critical to separate streaming token latency from total request time.
Traffic	Requests per second (RPS), query volume. Measures load on stateless endpoints.	Tokens per second (TPS), concurrent sessions. Must account for highly variable input/output lengths and context window usage.	Unit changes from discrete requests to continuous token streams; load is non-linear with respect to input size.
Errors	HTTP 4xx/5xx status codes, failed database transactions, timeouts.	Model-specific failures: hallucinations, policy violations, malformed JSON outputs, context window overflows, GPU out-of-memory (OOM) errors.	Errors are often semantic or functional (incorrect content) rather than protocol-level; requires content validation beyond HTTP codes.
Saturation	CPU utilization, memory usage, disk I/O, database connection pools.	GPU/TPU utilization, VRAM usage, KV cache memory pressure, batch queue depth. Bottleneck is accelerator memory/compute.	Primary resource constraints shift from general compute/IO to specialized hardware (GPU memory bandwidth, SRAM).
New Critical Signal: Quality	Not a core Golden Signal. Implied by error rate and business logs.	A primary signal. Measured via: correctness scores, hallucination rate, RAG precision/recall, instruction-following accuracy.	Must be monitored with the same rigor as errors; requires automated evaluation pipelines and can drift independently of system health.
New Critical Signal: Cost	Indirectly via infrastructure scaling. Roughly linear with traffic.	A primary, non-linear signal. Measured as cost per token, cost per session. Driven by model size, sequence length, and accelerator type.	Direct business metric; small changes in prompt design or user behavior can cause order-of-magnitude cost variance.
Alerting Thresholds	Based on static, historical baselines (e.g., latency > 200ms).	Must be dynamic and context-aware. Baseline varies by model version, input complexity, and accelerator load. Requires statistical drift detection.	Static thresholds fail; must use anomaly detection on metrics that have multi-modal distributions (e.g., latency for short vs. long prompts).
Root Cause Analysis	Tracing through service mesh, logs, and database queries.	Tracing through inference stack: prompt context, retrieved documents, model parameters, quantization level, and accelerator scheduler states.	Debugging requires visibility into the model's "reasoning" (e.g., attention patterns, retrieved context) and hardware scheduling.

PRODUCTION CANARY ANALYSIS

How Golden Signals Power Canary Analysis

Golden Signals provide the fundamental, high-level metrics required to perform automated, statistically rigorous canary analysis of new AI model deployments.

Golden Signals are the four universal metrics—latency, traffic, errors, and saturation—that provide a comprehensive, high-level view of any distributed service's health. In canary analysis, these signals are collected from both the stable baseline (control) and the new model version (canary) and compared using statistical tests. This comparison forms the objective basis for an automated deployment verdict, determining if the canary performs within acceptable bounds before a full rollout.

For AI services, these signals are adapted: latency measures inference time, traffic tracks request volume, errors capture failed inferences or hallucinations, and saturation monitors resource utilization like GPU memory. By defining Service Level Objectives (SLOs) for these signals, teams establish clear, quantitative success criteria. Automated analysis tools like Kayenta then evaluate the canary against these criteria, enabling data-driven promotion or rollback decisions that minimize risk during model updates.

GOLDEN SIGNALS

Frequently Asked Questions

Golden signals are the four foundational metrics used to monitor the health of any distributed service or AI system. This FAQ addresses common questions about their definition, application, and role in modern MLOps and production canary analysis.

The four golden signals are latency, traffic, errors, and saturation. These metrics provide a high-level, comprehensive view of a service's operational health by measuring how fast it responds, how much demand it handles, how often it fails, and how fully its resources are utilized. Originating from Google's Site Reliability Engineering (SRE) practices, they are considered 'golden' because they are universally applicable, easy to understand, and sufficient to identify most production issues without being overwhelmed by data.

Latency: The time taken to service a request. It's critical to distinguish between the latency of successful requests and that of failed ones.
Traffic: A measure of demand on the system, often quantified as requests per second, network I/O, or concurrent sessions.
Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s) or implicitly (e.g., incorrect content from an AI model).
Saturation: How 'full' a service is, measuring the utilization of constrained resources like CPU, memory, I/O, or, for AI models, GPU VRAM or token capacity.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

PRODUCTION CANARY ANALYSIS

Related Terms

Golden signals are a foundational concept within the broader practice of Production Canary Analysis. The following terms are essential for designing, executing, and evaluating controlled deployments.

Automated Canary Analysis (ACA)

The automated process of comparing key metrics—including the golden signals—between a baseline (control) deployment and a new candidate (canary) deployment. It uses statistical analysis to generate a deployment verdict (promote or rollback) without manual intervention.

Core Function: Continuously evaluates latency, traffic, errors, and saturation during a release.
Tools: Implemented by platforms like Kayenta, Argo Rollouts, and Flagger.
Output: Provides a pass/fail signal based on predefined Service Level Objectives (SLOs).

EXPLORE

Service Level Objective (SLO)

A target level of reliability or performance for a service, expressed as a measurable goal over a rolling time window. SLOs are the quantitative benchmarks against which golden signals are judged during a canary analysis.

Example: "99.9% of requests shall have a latency under 200ms over a 30-day window."
Relationship to Golden Signals: Defines the acceptable thresholds for latency (performance) and error rate (reliability).
Error Budget: The allowable amount of unreliability (1 - SLO); a canary failure consumes this budget.

Traffic Splitting

The controlled routing of a defined percentage of live user requests to different versions of a service. This is the mechanism that enables a canary deployment by directing a small slice of traffic (one of the golden signals) to the new model.

Implementation: Often managed by a service mesh (e.g., Istio VirtualService) or a deployment controller.
Progressive Rollout: Traffic percentage is gradually increased from 1% to 100% based on successful canary analysis.
Key Consideration: Must ensure traffic splitting is statistically representative to avoid skewed metric comparisons.

Deployment Verdict

The final decision to promote a new candidate to full production or rollback to the previous stable version. This verdict is the primary output of an Automated Canary Analysis (ACA) process that evaluates the golden signals.

Automation: The ideal state is a fully automated verdict based on breached SLOs.
Criteria: Based on statistical comparisons of canary vs. control metrics for errors, latency, and sometimes business KPIs.
Rollback Trigger: A significant degradation in any golden signal typically triggers an automated rollback to limit blast radius.

Blast Radius

The scope and potential impact of a failure introduced by a new deployment. A core goal of canary analysis is to minimize the blast radius by initially exposing the change to a very small subset of users or infrastructure.

Containment Strategy: Limited initially by low traffic percentage in a canary.
Golden Signals as Early Warning: A spike in errors or latency in the canary group signals a problem while the impact is contained.
Escalation: If the canary is healthy, the blast radius is intentionally increased through a progressive rollout.

Canary Metrics

The specific set of quantitative measurements collected and analyzed during a canary deployment. While the four golden signals provide a universal health check, canary metrics are often extended to include:

Business KPIs: Conversion rates, transaction values, or user engagement scores.
Domain-Specific Signals: For AI models, this includes prediction drift, hallucination rates, or confidence score distributions.
Infrastructure Saturation: Memory, CPU, and GPU utilization beyond generic service saturation.

These metrics are visualized in a canary analysis dashboard for real-time decision-making.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Golden Signals

What is Golden Signals?

The Four Golden Signals Explained

Latency

Traffic

Errors

Saturation

Application in Canary Analysis

Beyond Infrastructure: AI-Specific Signals

Golden Signals for AI vs. Traditional Services

How Golden Signals Power Canary Analysis

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Automated Canary Analysis (ACA)

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there