A Service Level Indicator (SLI) is a specific, measurable metric that quantifies a critical aspect of a service's performance or reliability from the user's perspective. In the context of Large Language Model (LLM) operations, common SLIs include latency percentiles (P95, P99), Time to First Token (TTFT), tokens-per-second throughput, and request success or error rates. These indicators provide the raw, objective data required to define and track service quality.
Glossary
Service Level Indicator (SLI)

What is a Service Level Indicator (SLI)?
A Service Level Indicator (SLI) is a quantitatively measured aspect of an LLM service's performance, such as request latency, throughput, or error rate, that is used to assess compliance with a Service Level Objective.
SLIs are the foundational inputs for Service Level Objectives (SLOs), which are the agreed-upon target values for each indicator. By continuously monitoring SLIs, engineering teams can calculate their error budget—the allowable deviation from SLOs—and make data-driven decisions about deployments, capacity planning, and incident response to maintain reliable LLM-powered applications.
Key Characteristics of an Effective SLI
A well-defined Service Level Indicator (SLI) is the foundation of reliable LLM operations. These characteristics ensure SLIs are measurable, actionable, and directly tied to user experience.
Quantitatively Measurable
An effective SLI must be expressed as a numerical quantity derived from observable system data. It cannot be a subjective judgment. For LLMs, this typically involves:
- Latency metrics: Time to First Token (TTFT), Inter-Token Latency, end-to-end request duration.
- Throughput metrics: Tokens per Second (TPS), successful requests per minute.
- Quality metrics: Perplexity scores, hallucination rates (though these require careful measurement).
- Availability metrics: Uptime percentage, error rate (4xx/5xx responses). The measurement must be automatable via telemetry systems like OpenTelemetry or Prometheus.
Directly User-Centric
The SLI should measure an aspect of the service that directly impacts the end-user's experience or the business outcome. Avoid proxy metrics that are easy to measure but not felt by users.
Good Examples:
- P99 latency for chat completions (users feel slow responses).
- Error rate for API requests (users get failed interactions).
- Token throughput for a summarization feature (users wait for the result).
Poor Examples:
- GPU utilization percentage (infrastructure concern, not user-facing).
- Cache hit rate (an internal optimization, not a user outcome).
Controllable by the Engineering Team
The performance of the SLI should be primarily influenced by engineering decisions and system changes within the team's purview. If an SLI is affected by external factors the team cannot mitigate, it fails as a useful indicator.
For LLM services, this means:
- Model serving infrastructure choices (batching, hardware) affect latency (TTFT, TPS).
- Application code and prompt engineering affect error rates and output validity.
- System architecture (caching, load balancing) affects availability.
An SLI like "third-party API latency" is not controllable if the dependency is external.
Aligned with a Service Level Objective (SLO)
An SLI is meaningless without a target. It must have a corresponding Service Level Objective (SLO)—a target value or range that defines acceptable performance. The SLO provides the context for whether the SLI's current value is good or bad.
Example Pairing:
- SLI: End-to-end latency for a text generation endpoint.
- SLO: 95% of requests complete within 2 seconds over a 28-day window.
The SLO, derived from the SLI's measurement, creates an error budget that guides deployment velocity and prioritization of reliability work.
Consistently Measured Over a Defined Window
SLI measurement must be consistent and comparable over time. This requires:
- A stable collection methodology (e.g., always measured at the API gateway).
- A defined aggregation window (e.g., rolling 28 days, daily) for assessment against SLOs.
- Clear aggregation rules (e.g., is the SLI a percentile, a mean, a ratio?).
For LLM latency, this often means tracking latency percentiles (P50, P90, P99) over a rolling window to understand both typical and tail performance. Inconsistency in measurement invalidates trend analysis and SLO compliance tracking.
Simple and Few in Number
A service should have a small, focused set of SLIs (typically 2-5) that capture its core reliability promises. Too many SLIs create noise and dilute focus. The goal is to identify the vital few metrics that truly indicate service health.
For a core LLM inference API, essential SLIs might be:
- Availability: Successful request ratio.
- Latency: P99 request duration.
- Throughput: Sustained Tokens per Second (for cost/performance).
Additional quality SLIs (e.g., for hallucination rate) may be added for specific, high-stakes use cases but should not overwhelm the core set.
Common SLI Examples for LLM Services
Quantitatively measured aspects of an LLM service's performance, used to assess compliance with Service Level Objectives (SLOs).
| SLI Category | Latency & Responsiveness | Throughput & Scalability | Quality & Correctness | Reliability & Availability |
|---|---|---|---|---|
Primary Metric | Time to First Token (TTFT) P99 < 2 sec | Sustained Tokens per Second (TPS) > 100 | Hallucination Rate < 3% | Request Success Rate > 99.9% |
Supporting Metric | Inter-Token Latency P95 < 100 ms | Peak Concurrent Requests > 1000 | Task-Specific Accuracy Score > 0.95 | Model/Endpoint Uptime > 99.95% |
Measurement Method | Distributed tracing from client request to first token streamed. | Aggregate token count over time from model serving layer. | Comparison of generated output against a golden dataset or human evaluation. | Count of successful HTTP 200 responses vs. 4xx/5xx errors and timeouts. |
Typical SLO Target | P99 TTFT < 1.5 sec for 28-day rolling window. | Sustained TPS > 150 for 95% of 5-minute intervals. | Hallucination rate remains within 2 percentage points of baseline. | Error budget consumption < 10% per month. |
Key Influencing Factors | Input prompt length, model size, GPU compute, network latency, prefill stage. | Batch size, KV cache efficiency, continuous batching, hardware acceleration. | Prompt clarity, model temperature, context relevance, retrieval-augmented generation (RAG) grounding. | Infrastructure health, dependency failures (e.g., vector database), quota limits, model loading time. |
Monitoring Tools | OpenTelemetry traces, Prometheus histograms, Grafana dashboards. | Custom exporters to Prometheus, vendor-specific metrics APIs. | Automated evaluation pipelines, human-in-the-loop (HITL) review platforms. | Synthetic probes, health check endpoints, load balancer metrics, application logs. |
Associated Risk | Poor user experience for streaming applications. | Inability to handle traffic spikes, queue buildup. | Loss of user trust, generation of incorrect or harmful content. | Service outages, violation of contractual agreements. |
Mitigation Strategy | Inference optimization (quantization), caching frequent prompts, scaling compute. | Implement continuous batching, scale horizontally, optimize KV cache usage. | Implement output validation, use RAG, fine-tune on domain data, adjust sampling parameters. | Implement graceful degradation, canary deployments, redundant endpoints, automated failover. |
How SLIs Relate to SLOs and Error Budgets
In LLM operations, Service Level Indicators (SLIs) are the foundational metrics that quantify performance, which are then formalized into targets via Service Level Objectives (SLOs) to create actionable error budgets for managing reliability.
A Service Level Indicator (SLI) is a directly measured quantitative metric of a service's performance or reliability, such as request latency, throughput (Tokens per Second), or error rate. For LLMs, common SLIs include Time to First Token (TTFT) and inter-token latency, which define the user-perceived responsiveness of a generative AI service. These raw measurements provide the empirical data needed to assess system health.
A Service Level Objective (SLO) is a target value or range for an SLI, defining the acceptable level of service, such as "P99 latency < 2 seconds." The difference between the SLO target and the actual SLI measurement, over a period like a month, creates an error budget. This budget quantifies the allowable unreliability, guiding engineering decisions on the risk and pace of new deployments, model updates, or infrastructure changes.
Frequently Asked Questions
A Service Level Indicator (SLI) is a quantitatively measured aspect of an LLM service's performance, such as request latency, throughput, or error rate, that is used to assess compliance with a Service Level Objective (SLO).
A Service Level Indicator (SLI) is a quantitative, directly measurable metric that quantifies a specific aspect of a service's performance or reliability. In the context of Large Language Model (LLM) operations, an SLI works by being continuously measured from live production traffic and compared against a predefined target, the Service Level Objective (SLO), to determine if the service is meeting its reliability goals.
How it works:
- Definition: Engineers select a critical user-facing aspect of the service, such as the latency of chat completions or the success rate of tool-calling requests.
- Measurement: Instrumentation (e.g., using OpenTelemetry) is added to the application code to record this metric for every request.
- Aggregation: Raw measurements are aggregated over a defined time window (e.g., a 28-day rolling window) into a single percentage or value (e.g., 99.9% of requests had latency under 2 seconds).
- Comparison: This aggregated value is compared to the SLO target. The difference between the SLI measurement and the SLO defines the remaining error budget, guiding operational decisions.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A Service Level Indicator (SLI) is a core component of a measurable reliability framework. It is defined in relation to other key operational concepts.
Error Budget
An Error Budget quantifies the allowable unreliability for a service over a specific period, derived directly from its SLO. It is calculated as 1 - SLO.
- Example: A 99.9% monthly availability SLO permits an error budget of 0.1% downtime, or approximately 43.2 minutes per month.
- This budget governs the pace of innovation and risk-taking; spending it on planned releases is acceptable, but unplanned incidents consume it.
- Exhausting the budget typically triggers a blameless postmortem and a freeze on new feature deployments.
Service Level Agreement (SLA)
A Service Level Agreement is a formal contract between a service provider and its customers that includes one or more SLOs, along with consequences (e.g., financial penalties) for failing to meet them.
- Key Distinction: An SLA is an external, commercial contract. An SLO is an internal, engineering target. An SLI is the measured metric.
- SLAs are typically less aggressive than internal SLOs to provide a safety buffer.
- For internal LLM platforms, the "customer" may be another engineering team, governed by an Internal SLA.
Golden Dataset
A Golden Dataset is a curated, high-quality, and versioned set of input-output pairs used as a reference standard for evaluating LLM performance and detecting regressions.
- It serves as the ground truth for monitoring Output Drift and Concept Drift.
- Used in canary and shadow deployments to compare new model versions against a baseline.
- Should be representative of critical production traffic and include edge cases. Maintaining its relevance over time is a key challenge.
Statistical Process Control (SPC)
Statistical Process Control is a method of quality control that uses statistical tools, primarily control charts, to monitor a process and detect abnormal variation.
- Applied to SLIs to distinguish between common-cause variation (normal noise) and special-cause variation (indicative of a problem).
- Control limits (e.g., 3-sigma) are calculated from historical SLI data. Data points outside these limits trigger alerts for Anomaly Detection.
- This provides a more robust alerting mechanism than simple static thresholds.
Mean Time to Recovery (MTTR)
Mean Time to Recovery is a key reliability metric measuring the average time taken to restore a service to normal operation after a failure or significant degradation is detected.
- Components of MTTR: Detection Time + Diagnosis Time + Mitigation Time + Restoration Time.
- A primary goal of SRE is to minimize MTTR through effective monitoring (SLIs), automation, and clear runbooks.
- While not an SLI itself, MTTR is often tracked as a supporting metric for the overall health of an operations team.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us