Inferensys

Glossary

Service Level Agreement (SLA)

A Service Level Agreement (SLA) is a formal contract between a service provider and customer that defines minimum performance levels, often with financial penalties for unmet Service Level Objectives (SLOs).
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
SLO/SLI DEFINITION FOR AI

What is a Service Level Agreement (SLA)?

A formal contract defining the minimum acceptable performance and availability of a service, with explicit consequences for non-compliance.

A Service Level Agreement (SLA) is a formal contract between a service provider and a customer that defines the minimum level of service expected, including specific Service Level Objectives (SLOs) and the financial penalties or service credits applied if those objectives are not met. In AI services, SLAs codify commitments for critical metrics like model inference latency, throughput, and quality targets such as hallucination rate, creating a binding framework for reliability.

For AI systems, an SLA operationalizes the error budget derived from SLOs, explicitly defining the remediation process when the budget is exhausted. It moves beyond internal engineering targets to establish legal and business accountability, covering aspects like Mean Time To Recovery (MTTR) and support responsiveness. This contract is essential for enterprise adoption, providing the certainty required to integrate AI-powered capabilities into core business processes.

SLO/SLI DEFINITION FOR AI

Key Components of an AI Service Level Agreement

An AI Service Level Agreement (SLA) is a formal contract that defines the minimum expected performance, availability, and quality of an AI-powered service. For AI systems, SLAs must account for unique, non-deterministic behaviors like hallucinations and data drift.

01

Service Level Objectives (SLOs)

An SLO is a quantitative target for service reliability or quality, expressed as a percentage over a time window. For AI services, SLOs extend beyond traditional uptime to include:

  • Model Quality SLOs: e.g., "99% of responses must have a hallucination rate below 5%."
  • Performance SLOs: e.g., "p95 inference latency must be under 500ms."
  • Business SLOs: e.g., "Agent task success rate must exceed 95%." SLOs are derived from Service Level Indicators (SLIs) and define the threshold for acceptable service.
02

Service Level Indicators (SLIs)

An SLI is a directly measurable metric quantifying a specific aspect of service performance. AI-specific SLIs are critical for evaluating SLOs:

  • Inference Latency: Time from request to model output (often split into Time To First Token (TTFT) and Time Per Output Token (TPOT)).
  • Error Rate: Percentage of requests resulting in a model error or crash.
  • Quality Metrics: Hallucination rate, answer faithfulness score, or Retrieval Precision@K for RAG systems.
  • Throughput: Queries processed per second (QPS) with techniques like continuous batching.
03

Error Budgets & Remediation

An error budget is the allowable amount of unreliability, calculated as 100% - SLO. It quantifies risk tolerance.

  • Purpose: Defines how many errors or SLO violations are acceptable before financial or operational penalties apply.
  • Burn Rate: The speed at which the error budget is consumed. A high burn rate triggers alerts.
  • Remediation Actions: Specifies steps if the budget is exhausted, which may include service credits, automatic rollbacks via canary deployment analysis, or dedicated engineering time for remediation.
04

AI-Specific Quality & Performance Clauses

AI SLAs require clauses addressing the probabilistic nature of models:

  • Hallucination & Factual Accuracy: Defines acceptable rates of unsupported output and methods for detection.
  • Data Drift & Model Decay: Specifies monitoring for data drift detection and obligations for model retraining.
  • Output Consistency: May include targets for variance in responses to identical inputs.
  • Context Window & Memory Limits: Defines limits on input token length and agentic context management.
  • Tool Calling Reliability: For agentic systems, specifies success rates for external API execution.
05

Monitoring, Reporting & Exclusions

This section defines how compliance is measured and reported, and what is excluded from SLA calculations.

  • Measurement Methodology: How SLIs are collected (e.g., client-side telemetry, server logs) and aggregated (e.g., percentile latency p95, p99).
  • Reporting Frequency: Regular SLA compliance reports (e.g., monthly).
  • Exclusions (Force Majeure): Standard exclusions for outages beyond provider control, plus AI-specific exclusions such as:
    • Attacks via adversarial testing or prompt injection.
    • Use outside defined Critical User Journeys (CUJs).
    • Use of unsupported input data formats or volumes.
06

Business & Legal Terms

The contractual framework that operationalizes the technical metrics.

  • Service Credits: Financial penalties or credits applied if SLOs are not met, often tiered based on error budget consumption.
  • Termination Rights: Conditions under which the customer can terminate the agreement for chronic SLA failures.
  • Data Ownership & Security: Asserts customer ownership of input data and output, and defines security standards.
  • Governance & Audit: Rights for customers to audit SLA measurements and the provider's AI governance practices.
  • Liability Caps: Limits on total liability, which is crucial given the potential impact of AI errors.
SERVICE LEVEL HIERARCHY

SLA vs. SLO vs. SLI: Core Definitions

A comparison of the three core components of service level management, defining their distinct roles, legal status, and typical content.

FeatureService Level Agreement (SLA)Service Level Objective (SLO)Service Level Indicator (SLI)

Core Definition

A formal contract defining the minimum expected service level, including penalties for non-compliance.

An internal, quantitative target for service reliability or performance, derived from SLIs.

A directly measurable metric quantifying a specific aspect of service performance.

Primary Audience

External customers or business stakeholders.

Internal engineering and product teams (e.g., SREs).

Internal engineering and operations teams.

Legal & Business Nature

Legally binding contract with financial or service credits.

Internal goal or target; not a customer-facing guarantee.

Raw measurement; a technical instrument.

Typical Content

Formal terms, SLOs, remedies, penalties, scope, exclusions.

A specific percentage target (e.g., 99.9%) over a time window (e.g., 30 days).

A precise metric definition and measurement method (e.g., latency p99 < 300ms).

Relationship

Contains one or more SLOs as its measurable commitments.

Defined using one or more SLIs as its basis for measurement.

The foundational measurement used to evaluate an SLO.

Example for AI Service

"Model inference API will have 99.5% availability monthly, or service credit applies."

"Target: 99.5% of requests have latency < 500ms over a rolling 28-day window."

"Metric: Latency measured from request receipt to final token streamed, calculated as p95."

Change Process

Requires formal negotiation and contract amendment.

Can be adjusted internally based on error budget and product needs.

Refined as monitoring improves; technical implementation detail.

Failure Consequence

Contractual breach leading to financial penalties or service credits.

Consumes error budget, informing release and operational decisions.

Triggers operational investigation; data point for SLO evaluation.

SERVICE LEVEL AGREEMENT

Examples of AI-Specific SLA Clauses

Traditional SLAs for uptime and latency are insufficient for AI services. These clauses define measurable, AI-specific quality guarantees, often tied to financial penalties or service credits.

01

Inference Latency & Throughput

Defines the maximum allowable time for a model to return a prediction and the minimum number of requests it must handle per second.

  • Time to First Token (TTFT): Latency from request start to first output token for LLMs.
  • Time Per Output Token (TPOT): Latency for each subsequent token in a stream.
  • Throughput (QPS): Queries per second, often with continuous batching to optimize GPU use.
  • Tail Latency (p95, p99): Guarantees for the slowest requests, which most impact user experience.

Example: "p95 end-to-end inference latency shall not exceed 250ms for 99.9% of requests over any 5-minute window."

02

Model Quality & Accuracy

Guarantees the statistical performance of the model against a held-out evaluation dataset or live traffic.

  • Accuracy/Precision/Recall/F1 Score: Minimum thresholds for classification tasks.
  • Hallucination Rate: Maximum permissible percentage of factually incorrect or unsupported outputs for generative models.
  • Answer Faithfulness/Attribution: For RAG systems, the degree to which answers are grounded in provided source documents.
  • Instruction Following Accuracy: Measures adherence to complex prompt constraints.

Example: "The model shall maintain a hallucination rate below 2% as measured by automated fact-checking against the provided context over a monthly period."

03

Retrieval System Performance (RAG)

Specific to Retrieval-Augmented Generation architectures, these clauses govern the quality of the document retrieval step.

  • Retrieval Precision/Recall@K: Proportion of relevant documents in the top K results.
  • Retrieval Latency: Time to fetch and rank context documents.
  • Context Relevance Score: Semantic match between query and retrieved chunks.
  • Source Attribution Completeness: Requirement to cite specific document passages.

Example: "The retrieval system shall achieve a Precision@5 of ≥85% for all queries, as evaluated by a monthly human audit sample."

04

Agentic Task Success

For autonomous AI agents, this defines the reliability of multi-step task completion.

  • End-to-End Task Success Rate: Percentage of user-initiated tasks completed without human intervention.
  • Tool Calling Success Rate: Reliability of external API executions.
  • Reasoning Trace Validity: Logical correctness of the agent's step-by-step reasoning.
  • Mean Time To Recovery (MTTR): For agent failures, the average time to auto-recover or escalate.

Example: "The customer service agent shall successfully resolve 92% of tier-1 support tickets within the defined workflow without escalation over a quarterly period."

05

Data & Model Governance

Clauses ensuring data privacy, model stability, and compliance with regulations.

  • Data Drift/Concept Drift Detection: Commitment to monitor and alert on significant input distribution shifts.
  • Model Version Rollback SLA: Maximum time to revert to a previous stable model version.
  • Privacy & Security: Adherence to differential privacy guarantees or data encryption standards.
  • Bias & Fairness Audits: Schedule and methodology for evaluating model performance across protected classes.

Example: "Provider will perform weekly statistical tests for data drift and notify Customer within 2 hours if drift exceeds a PSI threshold of 0.2."

06

Availability & Business Impact

Ties AI service reliability to business outcomes and defines remedies for failure.

  • Error Budget Consumption (Burn Rate): Defines alerting thresholds based on the rate of SLO violation.
  • Composite SLO: Overall reliability score derived from multiple AI-specific SLIs.
  • Business Metric Correlation: SLA may be linked to downstream metrics like conversion rate or customer satisfaction (CSAT).
  • Service Credits/Penalties: Financial remedies defined per violation, often scaling with severity and duration.

Example: "If the composite SLO for the recommendation service falls below 99.5% for a calendar month, Customer will receive a service credit equal to 15% of that month's fees."

SLO/SLI DEFINITION FOR AI

Frequently Asked Questions

Essential questions about Service Level Agreements (SLAs) and their critical role in defining, measuring, and enforcing performance and quality standards for AI-powered services.

A Service Level Agreement (SLA) is a formal contract between a service provider and a customer that defines the minimum acceptable level of service for an AI-powered system, including financial penalties or service credits if the specified Service Level Objectives (SLOs) are not met.

For AI services, an SLA goes beyond traditional infrastructure metrics to encompass model-specific quality indicators. This includes targets for:

  • Model inference latency and throughput.
  • Hallucination rate or answer faithfulness for generative systems.
  • Retrieval precision for RAG architectures.
  • Agent task success rate for autonomous systems.

The SLA operationalizes the error budget derived from SLOs, defining the concrete business and operational consequences of missing reliability or quality targets. It is the ultimate accountability mechanism, ensuring AI service performance is contractually binding.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.