Glossary

Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a directly measurable metric that quantifies a specific aspect of a service's performance, such as latency, error rate, or throughput, and serves as the basis for evaluating a Service Level Objective (SLO).

Get in touch Learn more

Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.

EVALUATION-DRIVEN DEVELOPMENT

What is Service Level Indicator (SLI)?

A Service Level Indicator (SLI) is a quantitative, user-centric measure of a service's performance or reliability. It is the foundational data point for Service Level Objectives (SLOs), which are the targets teams commit to. For AI services, common SLIs include model inference latency, error rate (e.g., HTTP 5xx responses), and quality metrics like hallucination rate or retrieval precision. An SLI must be precisely defined, consistently measurable, and directly tied to user experience.

Selecting the correct SLI is critical for effective SLO/SLI definition for AI. It should reflect a Critical User Journey (CUJ), such as the time to generate a complete answer for a chatbot. SLIs are monitored over a rolling time window, and their values are compared against the SLO to calculate error budget consumption. In complex systems, a composite SLO may aggregate multiple underlying SLIs to represent overall service health.

SLO/SLI DEFINITION FOR AI

Core Characteristics of an Effective SLI

A well-defined Service Level Indicator (SLI) is the cornerstone of reliable AI services. These characteristics ensure an SLI is measurable, actionable, and directly tied to user experience.

Directly Measurable

An effective SLI must be a quantifiable metric derived from observable system data, not an opinion or estimate. It is calculated from raw telemetry like logs, metrics, or tracing data.

Examples: Model inference latency (p95), successful request rate, token generation throughput.
Non-Examples: "User happiness," "system health," or "model seems accurate."

The measurement must be automatable and reproducible, forming an objective basis for evaluating an SLO.

User-Centric & Relevant

The SLI should measure an aspect of the service that directly impacts the end-user's experience or business outcome. It focuses on the external behavior of the service, not internal operational details.

User-Facing: Measures what the user perceives (e.g., end-to-end API latency for a chat completion).
Journey-Based: Often aligns with a Critical User Journey (CUJ), such as "time to receive a complete answer from the RAG system."

Internal metrics like GPU utilization are important for diagnostics but are poor SLIs, as users are unaffected by high utilization if latency remains low.

Well-Defined Aggregation & Window

The precise method of calculation and the time period of evaluation must be unambiguous. This defines what is being measured and over what duration.

Aggregation: Specify the statistical method (e.g., average, 95th percentile, ratio of counts). For latency, percentiles (p95, p99) are crucial to track tail performance.
Time Window: Define the rolling window for calculation (e.g., "the ratio of successful requests over the past 30 days"). This aligns with the SLO's compliance period and enables burn rate analysis.

Example: p95(model_inference_latency) over a 1-minute rolling window.

AI-Service Specific

For AI-powered services, SLIs must capture the unique quality and performance dimensions of machine learning inference and data pipelines, beyond traditional HTTP metrics.

Quality SLIs: Hallucination rate, answer faithfulness score, retrieval precision@K.
Performance SLIs: Time To First Token (TTFT), Time Per Output Token (TPOT), end-to-end RAG pipeline latency.
Data SLIs: Input data drift magnitude, training-serving skew detection.

These specialized indicators are essential for defining meaningful SLOs for AI systems.

Controllable & Actionable

The engineering team must have clear levers to influence the SLI. If the metric degrades, there should be known procedures or system changes that can improve it.

Actionable Example: High p99 latency can be addressed by optimizing model inference (e.g., enabling continuous batching), scaling resources, or simplifying a feature.
Non-Actionable Example: An SLI based on global internet latency is not controllable by the service team.

This characteristic ensures SLIs drive meaningful engineering and operational improvements.

Aligned with SLOs & Business Goals

An SLI is not defined in isolation; it is the measurable component of a Service Level Objective (SLO). The SLO sets the target for the SLI (e.g., "p95 latency < 500ms for 99.9% of requests this quarter").

Error Budget Derivation: The SLI's measurement directly calculates the error budget consumption.
Business Correlation: The best SLIs have a demonstrable link to business metrics like user retention, conversion rate, or revenue. For instance, high latency (an SLI) may correlate with cart abandonment (a business KPI).

This alignment turns technical metrics into drivers of business reliability.

METRIC CATEGORIES

Common SLIs for AI-Powered Services

A comparison of directly measurable Service Level Indicators (SLIs) across the primary quality dimensions of a production AI service.

SLI Name & Description	User Experience (UX) & Quality	System Performance & Efficiency	Data & Model Health
Model Inference Latency	✅ Core UX metric; time from request to final response.	✅ Direct measure of compute efficiency and system load.	❌ Not directly indicative of model quality.
Time To First Token (TTFT)	✅ Critical for perceived responsiveness in streaming.	✅ Measures initial processing and prompt encoding overhead.
Time Per Output Token (TPOT)	✅ Determines speed of streaming text/audio generation.	✅ Key throughput metric for autoregressive models.
Request Success Rate (Non-5xx)	✅ Primary indicator of service availability to users.	✅ Tracks infrastructure and dependency failures.	✅ Can signal upstream data source or model loading issues.
Hallucination Rate / Answer Faithfulness	✅ Direct measure of output factual correctness.	❌ Not a system performance metric.	✅ Core indicator of RAG/grounding effectiveness and model drift.
Retrieval Precision@K (for RAG)	✅ Impacts answer quality and user trust.		✅ Fundamental measure of retrieval subsystem health.
Token Throughput (Tokens/sec)	❌ Internal efficiency metric, not directly user-facing.	✅ Key for capacity planning and cost-per-query calculations.
GPU/TPU Utilization		✅ Core infrastructure efficiency and saturation signal.	✅ High variance can indicate batching issues or load imbalance.
Input/Output Token Count (P50, P95)	✅ Can correlate with cost and latency for users on tiered plans.	✅ Essential for predicting load and optimizing batching.	✅ Sudden shifts may indicate prompt injection or output degeneration.
Data Drift / Feature Distribution Shift		❌ Not a performance metric.	✅ Leading indicator of future model accuracy degradation.
Prediction Confidence Score Distribution	✅ Low confidence can be used to trigger human review.		✅ Key for monitoring model calibration and over/under-confidence.
Agent Task Success Rate	✅ End-to-end measure of autonomous agent effectiveness.	✅ Indirectly measures system reliability across tool calls.	✅ Holistic health check for multi-step reasoning and planning.

GLOSSARY

How to Define and Implement an SLI

An SLI is a quantitative, user-centric measure of a service's health. For AI services, common SLIs include model inference latency, error rate (e.g., non-2xx HTTP responses or failed validations), and throughput in queries per second. The definition process starts by identifying a Critical User Journey (CUJ) and selecting the metric that most directly reflects its quality from the user's perspective, ensuring the SLI is actionable and aligned with business value.

Implementation requires instrumenting the service to emit the raw data, then aggregating it into a reliable metric over a defined time window. This involves calculating percentiles (e.g., p95 latency) or ratios (e.g., successful requests / total requests). The SLI is then continuously monitored and compared against its SLO target. For AI systems, specialized SLIs like Time To First Token (TTFT), hallucination rate, or Retrieval Precision@K for RAG systems are essential for defining and upholding quality standards.

SERVICE LEVEL INDICATORS FOR AI

Frequently Asked Questions

Service Level Indicators (SLIs) are the foundational, measurable metrics that quantify the performance and reliability of AI-powered services. This FAQ addresses common questions about defining, implementing, and using SLIs to uphold rigorous Service Level Objectives (SLOs) in production AI systems.

A Service Level Indicator (SLI) is a directly measurable, quantitative metric that quantifies a specific aspect of an AI-powered service's performance, reliability, or quality, such as model inference latency, error rate, or throughput. For AI services, SLIs move beyond generic infrastructure metrics to capture the unique behaviors of machine learning systems, serving as the empirical basis for evaluating whether a Service Level Objective (SLO) is being met. Examples include Time To First Token (TTFT) for LLM responsiveness, retrieval precision for RAG systems, and hallucination rate for generative model accuracy.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

SLO/SLI DEFININITION FOR AI

Related Terms

Service Level Indicators (SLIs) are the foundation of quantitative reliability engineering. These related concepts define how SLIs are used to set targets, manage risk, and ensure AI-powered services meet user expectations.

Service Level Objective (SLO)

A Service Level Objective (SLO) is the quantitative target set for a Service Level Indicator (SLI). It defines the acceptable level of service, typically expressed as a percentage over a rolling time window (e.g., "99.9% of requests must have latency < 100ms over 30 days"). For AI services, SLOs apply to metrics like inference latency, error rate, or output quality. An SLO is an internal goal that balances reliability with innovation velocity, creating a formal error budget for the engineering team.

Error Budget

An error budget is the explicit, allowable amount of unreliability a service can incur without violating its Service Level Objective (SLO). It is calculated as 100% - SLO. For example, a 99.9% SLO creates a 0.1% error budget. This budget quantifies risk, enabling teams to make data-driven decisions about deploying new features, taking on technical debt, or performing risky migrations. Once the budget is exhausted, the focus must shift to improving reliability before further feature development.

Service Level Agreement (SLA)

A Service Level Agreement (SLA) is a formal contract with external customers that includes one or more Service Level Objectives (SLOs) and defines the business consequences—such as service credits or financial penalties—if those SLOs are not met. While an SLO is an internal engineering target, an SLA is an external promise. SLAs for AI services must be carefully scoped, as factors like unpredictable model behavior or upstream data provider issues can increase the risk of breach.

Golden Signal

A golden signal is one of four high-level metrics used in Site Reliability Engineering (SRE) to comprehensively assess a service's health. They are:

Latency: Time to serve a request.
Traffic: Demand on the system (e.g., queries per second).
Errors: Rate of failed requests.
Saturation: How "full" a resource is (e.g., GPU memory utilization). For AI services, these signals are foundational SLIs. Monitoring them provides a quick, holistic view of system performance and is the first step toward defining precise SLOs.

Critical User Journey (CUJ)

A Critical User Journey (CUJ) is a specific, high-value sequence of user interactions that is essential to the core value of a service (e.g., "user uploads a document, asks a question, and receives a summarized answer"). SLIs and SLOs should be derived from CUJs to ensure reliability targets align with user experience, not just backend system metrics. For AI services, this means measuring end-to-end latency and success rate for the complete journey, which may involve multiple model calls and retrieval steps.

Burn Rate

Burn rate is the speed at which a service consumes its error budget. It is calculated as the percentage of the budget used per unit of time (e.g., 20% of the monthly budget burned per hour). Monitoring burn rate is crucial for multi-window alerting. A high burn rate over a short window (e.g., 5 minutes) signals a potential incident, while a sustained moderate burn rate over a longer window (e.g., 6 hours) indicates a chronic issue that will violate the SLO if unaddressed.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.