Glossary

Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a quantitative measure of a specific aspect of a service's performance, such as request latency, error rate, or throughput, used to calculate compliance with a Service Level Objective (SLO).

Get in touch Learn more

Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.

PRODUCTION CANARY ANALYSIS

What is a Service Level Indicator (SLI)?

An SLI is a direct, measurable signal of a service's health from the user's perspective. Common examples include the proportion of successful HTTP requests (availability), the time taken to serve a request (latency), or the rate of valid outputs from an AI model (quality). In production canary analysis, SLIs are the primary metrics compared between the stable control group and the new canary deployment to detect performance regressions before a full rollout.

Defining precise SLIs is foundational to Evaluation-Driven Development. For AI services, SLIs extend beyond infrastructure to include model-specific metrics like inference latency, prediction accuracy, or hallucination rate. These indicators feed into Automated Canary Analysis (ACA) systems, which statistically evaluate SLI differences to generate a deployment verdict, ensuring releases meet predefined Service Level Objectives (SLOs) without degrading the user experience.

DEFINITION

Key Characteristics of an SLI

Quantitative and Measurable

An SLI must be a quantifiable metric derived from observable system data, not a subjective opinion. It is calculated from raw telemetry like request counts, error logs, or latency measurements. Examples include:

Request latency: The time taken to successfully process a request (e.g., 95th percentile latency < 200ms).
Error rate: The proportion of requests that result in a failure (e.g., (failed requests / total requests) * 100).
Throughput: The number of requests a system handles per second.
Availability: The proportion of time a service is operational and responding.

Directly Tied to User Experience

Effective SLIs measure aspects of the service that end-users directly perceive as quality. They should answer the question: "What does a good experience look like for our users?"

User-facing latency is a better SLI than internal CPU utilization.
HTTP 5xx error rate is more relevant than low-level disk I/O errors, unless those errors cause user-visible failures.
The definition of a 'successful' request must align with the user's goal (e.g., a search returning relevant results, not just a 200 OK).

Defined Over a Specific Aggregation

An SLI is not a single data point but a statistical aggregation over a defined time window and request population. This prevents noise from triggering unnecessary alerts.

Time Window: SLIs are evaluated over periods like 1 minute, 5 minutes, or 28 days (rolling).
Aggregation Method: Common methods include:
- Ratio: (Good events / Total eligible events) over the window.
- Distribution: Percentiles (p50, p95, p99) of a measurement like latency.
- Threshold: Percentage of time a metric is below/above a target.
Example: "The proportion of HTTP requests that succeeded over the last 5 minutes."

Aligned with a Service Level Objective (SLO)

An SLI is meaningless without a target threshold defined in an SLO. The SLO sets the acceptable performance level for the SLI.

SLI: The measurement itself (e.g., error rate calculated as 0.5%).
SLO: The target for that measurement (e.g., error rate ≤ 0.1%).
The error budget is then derived from this pairing: it's the allowable deviation from the SLO (e.g., 0.4% of requests can fail before the budget is exhausted). This creates a clear, data-driven framework for deciding when to halt deployments or prioritize reliability work.

Implementation via Reliable Telemetry

SLIs must be computed from high-fidelity, production-grade observability data. The measurement system must be more reliable than the service it monitors.

Data Sources: Application logs, structured metrics from exporters (Prometheus), distributed traces (OpenTelemetry), or load balancer access logs.
Instrumentation Points: SLIs should be measured as close to the user as possible, often at the service entry point (e.g., API gateway, load balancer).
Avoiding Bias: The measurement must cover all relevant traffic. Sampling can introduce bias and invalidate SLI calculations for low-volume services.

SLI Examples in AI/ML Services

For AI-powered services, SLIs must capture both infrastructure health and model quality. Key examples include:

Inference Latency: p95 latency for model prediction requests.
Model Throughput: Predictions per second the endpoint can handle.
Inference Error Rate: Percentage of prediction requests returning a 5xx error or a system-level failure.
Model Quality Drift: Percentage of predictions where confidence scores diverge significantly from a baseline, indicating potential performance degradation.
Hallucination Rate (for LLMs): Proportion of generated outputs flagged as factually incorrect or unsupported by the provided context.
Data Freshness: Age of the most recent data used for a prediction in a real-time system.

EVALUATION-DRIVEN DEVELOPMENT

How to Define and Implement an SLI

A Service Level Indicator (SLI) is the foundational, quantitative measurement for evaluating an AI service's performance against its reliability targets.

A Service Level Indicator (SLI) is a quantitative measure of a specific aspect of a service's performance, such as request latency, error rate, or throughput, used to calculate compliance with a Service Level Objective (SLO). In AI systems, SLIs extend beyond infrastructure to measure model-specific quality, including prediction accuracy, inference latency, and hallucination rates. Defining a precise SLI involves selecting a measurable event, a method of aggregation (e.g., a percentile or average), and a relevant time window for evaluation.

Implementation requires instrumenting the service to emit the raw data for the chosen metric, often via telemetry systems like Prometheus or OpenTelemetry. This data is then aggregated and compared against the SLO target to calculate an error budget. For AI canary deployments, SLIs are critical for Automated Canary Analysis (ACA), where metrics from the new version are statistically compared to a baseline to generate a deployment verdict. Effective SLIs are direct, representative of user experience, and aligned with business objectives.

SERVICE LEVEL HIERARCHY

SLI vs. SLO vs. SLA: A Comparison

A comparison of the three core components of service reliability management, detailing their purpose, format, and audience within the context of AI/ML service deployment.

Feature	Service Level Indicator (SLI)	Service Level Objective (SLO)	Service Level Agreement (SLA)
Core Definition	A quantitative measure of a specific aspect of service performance.	A target value or range for an SLI over a specific period.	A formal contract defining the consequences of failing to meet SLOs.
Primary Role	Measurement. The raw, observed metric.	Internal Goal. The target for the measured metric.	External Promise. The business commitment with penalties.
Format & Granularity	A precise metric (e.g., p99 latency = 225ms, error rate = 0.15%).	A target threshold (e.g., p99 latency < 250ms, error rate < 0.3%).	A legal document with financial/credit penalties (e.g., 99.9% uptime SLO, with service credits for breach).
Audience & Purpose	Engineering & SRE teams. Used for monitoring, debugging, and calculating SLO compliance.	Internal product & engineering teams. Defines the reliability target for development and operations.	External customers or business stakeholders. Defines the business risk and liability of service unreliability.
Example in AI/ML Context	Model inference latency measured at the 99th percentile. Token generation throughput. Hallucination rate detected by a validator model.	p99 model inference latency < 300ms for 95% of days in a quarter. Hallucination rate < 2%.	If the quarterly SLO for p99 latency is not met, the customer receives a 10% service credit. Defines the support response time for model downtime.
Relationship	The measured input. Feeds the SLO calculation.	The goal set for the SLI. Defines the error budget.	The business wrapper that incorporates SLOs and defines remedies.
Change Frequency	High. Metrics can be added or refined as the service evolves.	Medium. Reviewed and adjusted quarterly based on error budget consumption and business needs.	Low. Legally binding; changes require contract renegotiation.
Key Action Trigger	Alerting when a metric deviates from normal behavior.	Error budget burn rate alerts. Triggers a focus on reliability work.	Breach triggers contractual penalties (e.g., service credits, termination rights).

PRODUCTION CANARY ANALYSIS

Frequently Asked Questions

Service Level Indicators (SLIs) are the foundational metrics used to quantitatively evaluate the health and performance of AI services during controlled deployments like canary releases. These questions address their definition, implementation, and role in modern MLOps.

A Service Level Indicator (SLI) is a quantitative, directly measurable metric that quantifies a specific aspect of a service's performance or reliability from the user's perspective. It is the raw measurement used to calculate compliance with a Service Level Objective (SLO). For AI services, common SLIs include:

Request Latency: The time from when a user sends a request to when they receive a complete response, often measured as a percentile (e.g., p95, p99).
Error Rate: The proportion of requests that result in a failure, such as a 5xx HTTP status code, a model inference error, or a failed validation check.
Throughput: The number of successful requests the service handles per second.
Model Quality Metrics: For AI/ML services, this can include metrics like prediction accuracy, hallucination rate for generative models, or business Key Performance Indicators (KPIs) derived from model outputs.

An SLI must be well-defined, consistently measurable, and representative of the user experience. It serves as the foundational data point for all reliability engineering.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

PRODUCTION CANARY ANALYSIS

Related Terms

A Service Level Indicator (SLI) is a core component of a quantitative reliability framework. It works in concert with other key concepts to enable safe, data-driven deployments and operational excellence.

Service Level Objective (SLO)

A Service Level Objective (SLO) is a target value or range for a Service Level Indicator (SLI). It defines the acceptable level of service reliability over a specific time period.

Example: An SLO could be "99.9% of HTTP requests return a successful (2xx) response over a 30-day window."
Relationship to SLI: The SLI (e.g., error rate) is the measured metric; the SLO is the goal for that measurement.
Purpose: SLOs provide a clear, business-aligned target for engineering teams, forming the basis for error budgets and guiding prioritization decisions.

EXPLORE

Error Budget

An error budget is the calculated amount of allowable unreliability for a service, derived from its Service Level Objective (SLO). It is defined as 1 - SLO.

Calculation: If your SLO is 99.9% availability, your error budget is 0.1% unreliability over the compliance period.
Function: The error budget quantifies how much risk a team can take. Introducing new features, performing deployments, or conducting experiments consumes this budget.
Canary Analysis Link: A failed canary that causes a significant increase in errors consumes the error budget, triggering an automated rollback to preserve reliability.

Service Level Agreement (SLA)

A Service Level Agreement (SLA) is a formal contract between a service provider and a customer that includes Service Level Objectives (SLOs) and specifies the consequences (e.g., financial penalties) for failing to meet them.

Key Difference from SLO: An SLO is an internal engineering goal. An SLA is an external, contractual commitment.
Hierarchy: SLIs are measured to determine if SLOs are met. SLOs are set more stringently than SLAs to provide a safety buffer.
Example: An internal SLO might be 99.95% uptime to comfortably guarantee a customer-facing SLA of 99.9%.

Golden Signals

Golden Signals are four high-level metrics that provide a comprehensive view of a service's health from a user's perspective. They are the primary candidates for defining Service Level Indicators (SLIs).

Latency: The time it takes to service a request. SLI example: 95th percentile request duration.
Traffic: The demand placed on the system. SLI example: HTTP requests per second.
Errors: The rate of failed requests. SLI example: percentage of HTTP 5xx responses.
Saturation: How "full" the service is. SLI example: memory utilization percentage or queue depth.

These signals are foundational for canary analysis, where they are compared between the baseline and new deployment.

Automated Canary Analysis (ACA)

Automated Canary Analysis (ACA) is the process of programmatically comparing the Service Level Indicators (SLIs) of a canary deployment against a baseline (control) deployment to generate a pass/fail deployment verdict.

Process: Tools like Kayenta, Flagger, or Argo Rollouts collect identical SLI metrics (e.g., error rate, latency p99) from both the old and new service versions.
Statistical Testing: ACA uses statistical tests (e.g., Mann-Whitney U test) to determine if observed differences in SLIs are significant and indicate regression.
Outcome: Based on pre-defined thresholds, the ACA system automatically decides to promote the canary to full production or initiate a rollback.

EXPLORE

Synthetic Monitoring

Synthetic monitoring involves using scripted, simulated transactions to proactively test and measure the performance and availability of a service from external points. It is a key source of data for Service Level Indicators (SLIs).

Purpose: To measure service health and SLI compliance from a user's geographic perspective before real users are affected.
Contrast with RUM: Unlike Real User Monitoring (RUM), which measures actual user traffic, synthetic monitoring uses controlled, predictable scripts.
Canary Use Case: Synthetic probes can be directed at canary instances to validate functionality and performance (latency, success rate) as part of the deployment validation process, providing early warning signals.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Service Level Indicator (SLI)

What is a Service Level Indicator (SLI)?

Key Characteristics of an SLI

Quantitative and Measurable

Directly Tied to User Experience

Defined Over a Specific Aggregation

Aligned with a Service Level Objective (SLO)

Implementation via Reliable Telemetry

SLI Examples in AI/ML Services

How to Define and Implement an SLI

SLI vs. SLO vs. SLA: A Comparison

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Service Level Objective (SLO)

Automated Canary Analysis (ACA)

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there