Glossary

Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a quantitatively measured aspect of an LLM service's performance, such as request latency, throughput, or error rate, that is used to assess compliance with a Service Level Objective (SLO).

Get in touch Learn more

Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.

LLM PERFORMANCE MONITORING

What is a Service Level Indicator (SLI)?

A Service Level Indicator (SLI) is a specific, measurable metric that quantifies a critical aspect of a service's performance or reliability from the user's perspective. In the context of Large Language Model (LLM) operations, common SLIs include latency percentiles (P95, P99), Time to First Token (TTFT), tokens-per-second throughput, and request success or error rates. These indicators provide the raw, objective data required to define and track service quality.

SLIs are the foundational inputs for Service Level Objectives (SLOs), which are the agreed-upon target values for each indicator. By continuously monitoring SLIs, engineering teams can calculate their error budget—the allowable deviation from SLOs—and make data-driven decisions about deployments, capacity planning, and incident response to maintain reliable LLM-powered applications.

LLM PERFORMANCE MONITORING

Key Characteristics of an Effective SLI

A well-defined Service Level Indicator (SLI) is the foundation of reliable LLM operations. These characteristics ensure SLIs are measurable, actionable, and directly tied to user experience.

Quantitatively Measurable

An effective SLI must be expressed as a numerical quantity derived from observable system data. It cannot be a subjective judgment. For LLMs, this typically involves:

Latency metrics: Time to First Token (TTFT), Inter-Token Latency, end-to-end request duration.
Throughput metrics: Tokens per Second (TPS), successful requests per minute.
Quality metrics: Perplexity scores, hallucination rates (though these require careful measurement).
Availability metrics: Uptime percentage, error rate (4xx/5xx responses). The measurement must be automatable via telemetry systems like OpenTelemetry or Prometheus.

Directly User-Centric

The SLI should measure an aspect of the service that directly impacts the end-user's experience or the business outcome. Avoid proxy metrics that are easy to measure but not felt by users.

Good Examples:

P99 latency for chat completions (users feel slow responses).
Error rate for API requests (users get failed interactions).
Token throughput for a summarization feature (users wait for the result).

Poor Examples:

GPU utilization percentage (infrastructure concern, not user-facing).
Cache hit rate (an internal optimization, not a user outcome).

Controllable by the Engineering Team

The performance of the SLI should be primarily influenced by engineering decisions and system changes within the team's purview. If an SLI is affected by external factors the team cannot mitigate, it fails as a useful indicator.

For LLM services, this means:

Model serving infrastructure choices (batching, hardware) affect latency (TTFT, TPS).
Application code and prompt engineering affect error rates and output validity.
System architecture (caching, load balancing) affects availability.

An SLI like "third-party API latency" is not controllable if the dependency is external.

Aligned with a Service Level Objective (SLO)

An SLI is meaningless without a target. It must have a corresponding Service Level Objective (SLO)—a target value or range that defines acceptable performance. The SLO provides the context for whether the SLI's current value is good or bad.

Example Pairing:

SLI: End-to-end latency for a text generation endpoint.
SLO: 95% of requests complete within 2 seconds over a 28-day window.

The SLO, derived from the SLI's measurement, creates an error budget that guides deployment velocity and prioritization of reliability work.

Consistently Measured Over a Defined Window

SLI measurement must be consistent and comparable over time. This requires:

A stable collection methodology (e.g., always measured at the API gateway).
A defined aggregation window (e.g., rolling 28 days, daily) for assessment against SLOs.
Clear aggregation rules (e.g., is the SLI a percentile, a mean, a ratio?).

For LLM latency, this often means tracking latency percentiles (P50, P90, P99) over a rolling window to understand both typical and tail performance. Inconsistency in measurement invalidates trend analysis and SLO compliance tracking.

Simple and Few in Number

A service should have a small, focused set of SLIs (typically 2-5) that capture its core reliability promises. Too many SLIs create noise and dilute focus. The goal is to identify the vital few metrics that truly indicate service health.

For a core LLM inference API, essential SLIs might be:

Availability: Successful request ratio.
Latency: P99 request duration.
Throughput: Sustained Tokens per Second (for cost/performance).

Additional quality SLIs (e.g., for hallucination rate) may be added for specific, high-stakes use cases but should not overwhelm the core set.

SERVICE LEVEL INDICATORS

Common SLI Examples for LLM Services

Quantitatively measured aspects of an LLM service's performance, used to assess compliance with Service Level Objectives (SLOs).

SLI Category	Latency & Responsiveness	Throughput & Scalability	Quality & Correctness	Reliability & Availability
Primary Metric	Time to First Token (TTFT) P99 < 2 sec	Sustained Tokens per Second (TPS) > 100	Hallucination Rate < 3%	Request Success Rate > 99.9%
Supporting Metric	Inter-Token Latency P95 < 100 ms	Peak Concurrent Requests > 1000	Task-Specific Accuracy Score > 0.95	Model/Endpoint Uptime > 99.95%
Measurement Method	Distributed tracing from client request to first token streamed.	Aggregate token count over time from model serving layer.	Comparison of generated output against a golden dataset or human evaluation.	Count of successful HTTP 200 responses vs. 4xx/5xx errors and timeouts.
Typical SLO Target	P99 TTFT < 1.5 sec for 28-day rolling window.	Sustained TPS > 150 for 95% of 5-minute intervals.	Hallucination rate remains within 2 percentage points of baseline.	Error budget consumption < 10% per month.
Key Influencing Factors	Input prompt length, model size, GPU compute, network latency, prefill stage.	Batch size, KV cache efficiency, continuous batching, hardware acceleration.	Prompt clarity, model temperature, context relevance, retrieval-augmented generation (RAG) grounding.	Infrastructure health, dependency failures (e.g., vector database), quota limits, model loading time.
Monitoring Tools	OpenTelemetry traces, Prometheus histograms, Grafana dashboards.	Custom exporters to Prometheus, vendor-specific metrics APIs.	Automated evaluation pipelines, human-in-the-loop (HITL) review platforms.	Synthetic probes, health check endpoints, load balancer metrics, application logs.
Associated Risk	Poor user experience for streaming applications.	Inability to handle traffic spikes, queue buildup.	Loss of user trust, generation of incorrect or harmful content.	Service outages, violation of contractual agreements.
Mitigation Strategy	Inference optimization (quantization), caching frequent prompts, scaling compute.	Implement continuous batching, scale horizontally, optimize KV cache usage.	Implement output validation, use RAG, fine-tune on domain data, adjust sampling parameters.	Implement graceful degradation, canary deployments, redundant endpoints, automated failover.

LLM PERFORMANCE MONITORING

How SLIs Relate to SLOs and Error Budgets

In LLM operations, Service Level Indicators (SLIs) are the foundational metrics that quantify performance, which are then formalized into targets via Service Level Objectives (SLOs) to create actionable error budgets for managing reliability.

A Service Level Indicator (SLI) is a directly measured quantitative metric of a service's performance or reliability, such as request latency, throughput (Tokens per Second), or error rate. For LLMs, common SLIs include Time to First Token (TTFT) and inter-token latency, which define the user-perceived responsiveness of a generative AI service. These raw measurements provide the empirical data needed to assess system health.

A Service Level Objective (SLO) is a target value or range for an SLI, defining the acceptable level of service, such as "P99 latency < 2 seconds." The difference between the SLO target and the actual SLI measurement, over a period like a month, creates an error budget. This budget quantifies the allowable unreliability, guiding engineering decisions on the risk and pace of new deployments, model updates, or infrastructure changes.

SERVICE LEVEL INDICATOR

Frequently Asked Questions

A Service Level Indicator (SLI) is a quantitative, directly measurable metric that quantifies a specific aspect of a service's performance or reliability. In the context of Large Language Model (LLM) operations, an SLI works by being continuously measured from live production traffic and compared against a predefined target, the Service Level Objective (SLO), to determine if the service is meeting its reliability goals.

How it works:

Definition: Engineers select a critical user-facing aspect of the service, such as the latency of chat completions or the success rate of tool-calling requests.
Measurement: Instrumentation (e.g., using OpenTelemetry) is added to the application code to record this metric for every request.
Aggregation: Raw measurements are aggregated over a defined time window (e.g., a 28-day rolling window) into a single percentage or value (e.g., 99.9% of requests had latency under 2 seconds).
Comparison: This aggregated value is compared to the SLO target. The difference between the SLI measurement and the SLO defines the remaining error budget, guiding operational decisions.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

LLM PERFORMANCE MONITORING

Related Terms

A Service Level Indicator (SLI) is a core component of a measurable reliability framework. It is defined in relation to other key operational concepts.

Service Level Objective (SLO)

A Service Level Objective is a target value or range for an SLI that defines the acceptable performance and reliability of a service. It is the formal goal against which service health is measured.

Example: "99.9% of LLM chat completions must have a Time to First Token (TTFT) under 2 seconds."
SLOs are business agreements that balance user expectations with engineering feasibility.
Violating an SLO consumes the Error Budget.

EXPLORE

Error Budget

An Error Budget quantifies the allowable unreliability for a service over a specific period, derived directly from its SLO. It is calculated as 1 - SLO.

Example: A 99.9% monthly availability SLO permits an error budget of 0.1% downtime, or approximately 43.2 minutes per month.
This budget governs the pace of innovation and risk-taking; spending it on planned releases is acceptable, but unplanned incidents consume it.
Exhausting the budget typically triggers a blameless postmortem and a freeze on new feature deployments.

Service Level Agreement (SLA)

A Service Level Agreement is a formal contract between a service provider and its customers that includes one or more SLOs, along with consequences (e.g., financial penalties) for failing to meet them.

Key Distinction: An SLA is an external, commercial contract. An SLO is an internal, engineering target. An SLI is the measured metric.
SLAs are typically less aggressive than internal SLOs to provide a safety buffer.
For internal LLM platforms, the "customer" may be another engineering team, governed by an Internal SLA.

Golden Dataset

A Golden Dataset is a curated, high-quality, and versioned set of input-output pairs used as a reference standard for evaluating LLM performance and detecting regressions.

It serves as the ground truth for monitoring Output Drift and Concept Drift.
Used in canary and shadow deployments to compare new model versions against a baseline.
Should be representative of critical production traffic and include edge cases. Maintaining its relevance over time is a key challenge.

Statistical Process Control (SPC)

Statistical Process Control is a method of quality control that uses statistical tools, primarily control charts, to monitor a process and detect abnormal variation.

Applied to SLIs to distinguish between common-cause variation (normal noise) and special-cause variation (indicative of a problem).
Control limits (e.g., 3-sigma) are calculated from historical SLI data. Data points outside these limits trigger alerts for Anomaly Detection.
This provides a more robust alerting mechanism than simple static thresholds.

Mean Time to Recovery (MTTR)

Mean Time to Recovery is a key reliability metric measuring the average time taken to restore a service to normal operation after a failure or significant degradation is detected.

Components of MTTR: Detection Time + Diagnosis Time + Mitigation Time + Restoration Time.
A primary goal of SRE is to minimize MTTR through effective monitoring (SLIs), automation, and clear runbooks.
While not an SLI itself, MTTR is often tracked as a supporting metric for the overall health of an operations team.

< 5 min

Elite Target for Critical Incidents

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Service Level Indicator (SLI)

What is a Service Level Indicator (SLI)?

Key Characteristics of an Effective SLI

Quantitatively Measurable

Directly User-Centric

Controllable by the Engineering Team

Aligned with a Service Level Objective (SLO)

Consistently Measured Over a Defined Window

Simple and Few in Number

Common SLI Examples for LLM Services

How SLIs Relate to SLOs and Error Budgets

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Service Level Objective (SLO)

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there