Inferensys

Glossary

Service Level Indicator (SLI)

A Service Level Indicator (SLI) is a quantitatively measured aspect of an LLM service's performance, such as request latency, throughput, or error rate, that is used to assess compliance with a Service Level Objective (SLO).
Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.
LLM PERFORMANCE MONITORING

What is a Service Level Indicator (SLI)?

A Service Level Indicator (SLI) is a quantitatively measured aspect of an LLM service's performance, such as request latency, throughput, or error rate, that is used to assess compliance with a Service Level Objective.

A Service Level Indicator (SLI) is a specific, measurable metric that quantifies a critical aspect of a service's performance or reliability from the user's perspective. In the context of Large Language Model (LLM) operations, common SLIs include latency percentiles (P95, P99), Time to First Token (TTFT), tokens-per-second throughput, and request success or error rates. These indicators provide the raw, objective data required to define and track service quality.

SLIs are the foundational inputs for Service Level Objectives (SLOs), which are the agreed-upon target values for each indicator. By continuously monitoring SLIs, engineering teams can calculate their error budget—the allowable deviation from SLOs—and make data-driven decisions about deployments, capacity planning, and incident response to maintain reliable LLM-powered applications.

LLM PERFORMANCE MONITORING

Key Characteristics of an Effective SLI

A well-defined Service Level Indicator (SLI) is the foundation of reliable LLM operations. These characteristics ensure SLIs are measurable, actionable, and directly tied to user experience.

01

Quantitatively Measurable

An effective SLI must be expressed as a numerical quantity derived from observable system data. It cannot be a subjective judgment. For LLMs, this typically involves:

  • Latency metrics: Time to First Token (TTFT), Inter-Token Latency, end-to-end request duration.
  • Throughput metrics: Tokens per Second (TPS), successful requests per minute.
  • Quality metrics: Perplexity scores, hallucination rates (though these require careful measurement).
  • Availability metrics: Uptime percentage, error rate (4xx/5xx responses). The measurement must be automatable via telemetry systems like OpenTelemetry or Prometheus.
02

Directly User-Centric

The SLI should measure an aspect of the service that directly impacts the end-user's experience or the business outcome. Avoid proxy metrics that are easy to measure but not felt by users.

Good Examples:

  • P99 latency for chat completions (users feel slow responses).
  • Error rate for API requests (users get failed interactions).
  • Token throughput for a summarization feature (users wait for the result).

Poor Examples:

  • GPU utilization percentage (infrastructure concern, not user-facing).
  • Cache hit rate (an internal optimization, not a user outcome).
03

Controllable by the Engineering Team

The performance of the SLI should be primarily influenced by engineering decisions and system changes within the team's purview. If an SLI is affected by external factors the team cannot mitigate, it fails as a useful indicator.

For LLM services, this means:

  • Model serving infrastructure choices (batching, hardware) affect latency (TTFT, TPS).
  • Application code and prompt engineering affect error rates and output validity.
  • System architecture (caching, load balancing) affects availability.

An SLI like "third-party API latency" is not controllable if the dependency is external.

04

Aligned with a Service Level Objective (SLO)

An SLI is meaningless without a target. It must have a corresponding Service Level Objective (SLO)—a target value or range that defines acceptable performance. The SLO provides the context for whether the SLI's current value is good or bad.

Example Pairing:

  • SLI: End-to-end latency for a text generation endpoint.
  • SLO: 95% of requests complete within 2 seconds over a 28-day window.

The SLO, derived from the SLI's measurement, creates an error budget that guides deployment velocity and prioritization of reliability work.

05

Consistently Measured Over a Defined Window

SLI measurement must be consistent and comparable over time. This requires:

  • A stable collection methodology (e.g., always measured at the API gateway).
  • A defined aggregation window (e.g., rolling 28 days, daily) for assessment against SLOs.
  • Clear aggregation rules (e.g., is the SLI a percentile, a mean, a ratio?).

For LLM latency, this often means tracking latency percentiles (P50, P90, P99) over a rolling window to understand both typical and tail performance. Inconsistency in measurement invalidates trend analysis and SLO compliance tracking.

06

Simple and Few in Number

A service should have a small, focused set of SLIs (typically 2-5) that capture its core reliability promises. Too many SLIs create noise and dilute focus. The goal is to identify the vital few metrics that truly indicate service health.

For a core LLM inference API, essential SLIs might be:

  1. Availability: Successful request ratio.
  2. Latency: P99 request duration.
  3. Throughput: Sustained Tokens per Second (for cost/performance).

Additional quality SLIs (e.g., for hallucination rate) may be added for specific, high-stakes use cases but should not overwhelm the core set.

SERVICE LEVEL INDICATORS

Common SLI Examples for LLM Services

Quantitatively measured aspects of an LLM service's performance, used to assess compliance with Service Level Objectives (SLOs).

SLI CategoryLatency & ResponsivenessThroughput & ScalabilityQuality & CorrectnessReliability & Availability

Primary Metric

Time to First Token (TTFT) P99 < 2 sec

Sustained Tokens per Second (TPS) > 100

Hallucination Rate < 3%

Request Success Rate > 99.9%

Supporting Metric

Inter-Token Latency P95 < 100 ms

Peak Concurrent Requests > 1000

Task-Specific Accuracy Score > 0.95

Model/Endpoint Uptime > 99.95%

Measurement Method

Distributed tracing from client request to first token streamed.

Aggregate token count over time from model serving layer.

Comparison of generated output against a golden dataset or human evaluation.

Count of successful HTTP 200 responses vs. 4xx/5xx errors and timeouts.

Typical SLO Target

P99 TTFT < 1.5 sec for 28-day rolling window.

Sustained TPS > 150 for 95% of 5-minute intervals.

Hallucination rate remains within 2 percentage points of baseline.

Error budget consumption < 10% per month.

Key Influencing Factors

Input prompt length, model size, GPU compute, network latency, prefill stage.

Batch size, KV cache efficiency, continuous batching, hardware acceleration.

Prompt clarity, model temperature, context relevance, retrieval-augmented generation (RAG) grounding.

Infrastructure health, dependency failures (e.g., vector database), quota limits, model loading time.

Monitoring Tools

OpenTelemetry traces, Prometheus histograms, Grafana dashboards.

Custom exporters to Prometheus, vendor-specific metrics APIs.

Automated evaluation pipelines, human-in-the-loop (HITL) review platforms.

Synthetic probes, health check endpoints, load balancer metrics, application logs.

Associated Risk

Poor user experience for streaming applications.

Inability to handle traffic spikes, queue buildup.

Loss of user trust, generation of incorrect or harmful content.

Service outages, violation of contractual agreements.

Mitigation Strategy

Inference optimization (quantization), caching frequent prompts, scaling compute.

Implement continuous batching, scale horizontally, optimize KV cache usage.

Implement output validation, use RAG, fine-tune on domain data, adjust sampling parameters.

Implement graceful degradation, canary deployments, redundant endpoints, automated failover.

LLM PERFORMANCE MONITORING

How SLIs Relate to SLOs and Error Budgets

In LLM operations, Service Level Indicators (SLIs) are the foundational metrics that quantify performance, which are then formalized into targets via Service Level Objectives (SLOs) to create actionable error budgets for managing reliability.

A Service Level Indicator (SLI) is a directly measured quantitative metric of a service's performance or reliability, such as request latency, throughput (Tokens per Second), or error rate. For LLMs, common SLIs include Time to First Token (TTFT) and inter-token latency, which define the user-perceived responsiveness of a generative AI service. These raw measurements provide the empirical data needed to assess system health.

A Service Level Objective (SLO) is a target value or range for an SLI, defining the acceptable level of service, such as "P99 latency < 2 seconds." The difference between the SLO target and the actual SLI measurement, over a period like a month, creates an error budget. This budget quantifies the allowable unreliability, guiding engineering decisions on the risk and pace of new deployments, model updates, or infrastructure changes.

SERVICE LEVEL INDICATOR

Frequently Asked Questions

A Service Level Indicator (SLI) is a quantitatively measured aspect of an LLM service's performance, such as request latency, throughput, or error rate, that is used to assess compliance with a Service Level Objective (SLO).

A Service Level Indicator (SLI) is a quantitative, directly measurable metric that quantifies a specific aspect of a service's performance or reliability. In the context of Large Language Model (LLM) operations, an SLI works by being continuously measured from live production traffic and compared against a predefined target, the Service Level Objective (SLO), to determine if the service is meeting its reliability goals.

How it works:

  1. Definition: Engineers select a critical user-facing aspect of the service, such as the latency of chat completions or the success rate of tool-calling requests.
  2. Measurement: Instrumentation (e.g., using OpenTelemetry) is added to the application code to record this metric for every request.
  3. Aggregation: Raw measurements are aggregated over a defined time window (e.g., a 28-day rolling window) into a single percentage or value (e.g., 99.9% of requests had latency under 2 seconds).
  4. Comparison: This aggregated value is compared to the SLO target. The difference between the SLI measurement and the SLO defines the remaining error budget, guiding operational decisions.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.