A health check is a periodic probe or request sent to a service instance to verify its operational status and readiness to receive traffic. In containerized environments like Kubernetes, these are implemented as liveness probes (to determine if a container needs restarting) and readiness probes (to determine if a container can accept requests). For AI services, a health check endpoint typically validates that the model server, its dependencies, and any required hardware accelerators are functioning correctly before the service is included in a load balancer's pool.
Glossary
Health Check

What is a Health Check?
A health check is a fundamental mechanism for verifying the operational status of a service, forming the basis for Service Level Indicators (SLIs) and Objectives (SLOs) in AI-powered systems.
Effective health checks are a critical Service Level Indicator (SLI) for availability, directly informing Service Level Objectives (SLOs). They provide the foundational signal for site reliability engineering (SRE) practices, enabling automated remediation and supporting graceful degradation. A failed health check can trigger alerts based on error budget burn rate or initiate a canary deployment rollback, ensuring that user-facing Critical User Journeys (CUJs) are not impacted by an unhealthy backend component.
Core Characteristics of a Health Check
A health check is a periodic probe or request sent to a service instance to verify its operational status and readiness to receive traffic. In AI systems, these checks are critical for maintaining the reliability defined by Service Level Objectives (SLOs).
Liveness vs. Readiness Probes
Health checks are typically implemented as two distinct probe types in containerized environments like Kubernetes.
- Liveness Probe: Determines if the service process is running. A failure triggers a container restart.
- Readiness Probe: Determines if the service is ready to accept traffic (e.g., model loaded, dependencies connected). A failure removes the pod from the load balancer. For AI services, a readiness probe might check that the model weights are loaded into GPU memory and that the vector database connection is active.
Endpoint Design & Response Codes
A health check is implemented as a dedicated, lightweight HTTP endpoint (e.g., /health).
- A 200 OK response with a simple JSON payload (
{"status": "healthy"}) indicates operational health. - Any 4xx or 5xx status code signals failure. For AI services, the endpoint logic should verify critical subsystems: the model inference engine, tokenizer, and any retrieval or memory backends (e.g., vector database). The check must be low-latency to avoid becoming a performance bottleneck itself.
Integration with SLOs & Error Budgets
Health check failures directly impact Service Level Indicators (SLIs) like error rate and availability, which are measured against Service Level Objectives (SLOs).
- A failing readiness probe that causes a pod to be taken out of service may protect the overall error budget by preventing bad requests.
- Persistent liveness probe failures that cause frequent restarts consume the error budget through increased downtime. Health check status is a foundational signal for calculating SLO burn rate and triggering alerts before user impact becomes severe.
AI-Specific Health Indicators
Beyond process status, health checks for AI services must validate model-specific states:
- Model Load Status: Confirms the serialized model is loaded and accessible in memory/GPU.
- Context Window Availability: For LLMs, checks that the request queue or continuous batching system is not saturated.
- Tool/API Connectivity: For agentic systems, verifies connections to external tools and APIs defined in the Model Context Protocol.
- Retrieval System Latency: Probes the vector database or knowledge graph to ensure retrieval SLIs (e.g., p95 latency) are within acceptable bounds for RAG systems.
Frequency, Timeouts, and Failure Thresholds
Health check behavior is governed by three key parameters:
- periodSeconds: How often to perform the probe (e.g., every 10 seconds).
- timeoutSeconds: Time to wait for a response before considering it a failure (must be less than the period).
- failureThreshold: The number of consecutive failures required to declare the container unhealthy. These settings must be tuned for AI workloads. A long model inference latency might require a longer timeout for a readiness check that runs a tiny inference to validate the pipeline.
Synthetic Transactions & Critical User Journeys
Advanced health checks simulate real user behavior via synthetic transactions. For an AI service, this could be:
- Sending a canonical query to a RAG system and validating the response contains citations.
- Executing a simple tool-calling sequence with an autonomous agent.
- Measuring Time to First Token (TTFT) and Time Per Output Token (TPOT) for a streaming LLM endpoint. These checks validate the entire Critical User Journey (CUJ) and provide a stronger signal than a simple endpoint ping, closely aligning health with user-centric SLOs.
How Health Checks Work in AI Systems
A health check is a periodic probe or request sent to a service instance to verify its operational status and readiness to receive traffic, often implemented as liveness and readiness probes in containerized environments.
In AI system architecture, a health check is a lightweight, automated diagnostic request sent to a service endpoint—such as a model inference API—to verify its operational status and readiness to handle production traffic. These checks, often implemented as liveness and readiness probes in container orchestration platforms like Kubernetes, form a foundational Service Level Indicator (SLI) for system availability. A failed health check typically triggers an alert or initiates automatic remediation, such as restarting a pod or rerouting traffic, to maintain Service Level Objective (SLO) compliance for uptime.
For AI services, health checks extend beyond simple HTTP status codes to validate critical dependencies. A comprehensive probe may test connectivity to vector databases, check for GPU memory saturation, or verify that the loaded machine learning model can execute a trivial inference. This ensures the entire serving stack, not just the web server, is functional. Integrating these checks with SLO monitoring and error budget tracking allows engineering teams to preemptively address degradation before it impacts Critical User Journeys (CUJs), such as a user query to a Retrieval-Augmented Generation (RAG) system.
Health Check Examples for AI Services
A health check is a periodic probe or request sent to a service instance to verify its operational status and readiness to receive traffic. For AI systems, these checks must validate both infrastructure and model-specific functionality.
Liveness Probe for Model Servers
A liveness probe determines if a container or service process is running. For an AI model server (e.g., a TensorFlow Serving or vLLM instance), this is a simple HTTP GET request to a designated /health/live endpoint.
- Purpose: Signals to the orchestrator (Kubernetes) that the process has not crashed and should be restarted if unhealthy.
- Implementation: Often a lightweight check that the server's HTTP stack is responsive, without invoking the model.
- Example:
curl -f http://model-service:8000/health/livereturns HTTP 200 if the server process is alive.
Readiness Probe for Warm Models
A readiness probe assesses if a service is ready to accept user traffic. For AI services, this must confirm the model is loaded into memory and the inference engine is initialized.
- Purpose: Prevents traffic from being routed to a pod that is booting up or a model that is still loading.
- Critical Check: Verifies GPU availability and that the model's computational graph is cached. A failed readiness probe tells the load balancer to stop sending requests.
- Implementation: A call to a
/health/readyendpoint that performs a trivial inference (e.g., on a zero tensor) to validate the full pipeline.
Dependency Health for RAG Systems
Complex AI services like Retrieval-Augmented Generation (RAG) have critical downstream dependencies. A comprehensive health check must validate each component.
- Typical Dependencies: Vector database (e.g., Pinecone, Weaviate), embedding model endpoint, and the core LLM provider.
- Implementation Pattern: The health check endpoint performs a lightweight query to each dependency:
- A
pingto the vector database cluster. - A simple embedding generation for a test string.
- A tokenization call to the LLM API.
- A
- Failure Action: If any dependency is unhealthy, the service marks itself as not ready, preventing partial failures for user requests.
Model-Specific Quality Gate
Beyond basic uptime, health checks can validate model performance and output quality to catch silent degradations before users are impacted.
- Example Checks:
- Output Schema Validation: Generate a response to a canned prompt and validate the JSON structure against a predefined schema.
- Numerical Stability: Run a fixed inference with a random seed and assert the output logits or embeddings are within an expected numerical range.
- Hallucination Baseline: For a RAG system, query with a known fact and assert the answer contains a required keyword or entity.
- Frequency: These deeper checks run less frequently (e.g., every 30 seconds) than simple liveness probes (every 2 seconds) due to higher computational cost.
Latency & Throughput Sentinel
A health check can monitor inference performance against internal Service Level Indicators (SLIs) to detect infrastructure degradation.
- How it works: The probe executes a standardized, representative inference request and measures the latency.
- Alerting: If the p95 latency for the probe exceeds a threshold (e.g., 500ms for a simple task), the health check fails. This can indicate:
- GPU thermal throttling.
- Memory contention from other processes.
- Network latency spikes to a remote model host.
- Throughput Check: Can also verify the service accepts a small burst of concurrent requests without queueing errors.
Circuit Breaker Integration
Health check results are integrated with client-side circuit breakers (e.g., using libraries like Resilience4j or Polly) to prevent cascading failures.
- Mechanism: The client library tracks the failure rate of recent requests to an instance. If failures exceed a threshold, the circuit opens and all subsequent requests fail fast for a cooldown period.
- Health Check Role: After the cooldown, a single health check request is sent as a half-open test. If it succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit re-opens.
- Benefit: This pattern protects the overall system by isolating unhealthy model instances, allowing them time to recover or be replaced by the orchestrator.
Health Check vs. Related Monitoring Concepts
A comparison of the health check, a fundamental liveness probe, with other key monitoring constructs used to define and enforce service reliability for AI systems.
| Feature / Purpose | Health Check (Liveness/Readiness Probe) | Service Level Indicator (SLI) | Service Level Objective (SLO) | Golden Signal |
|---|---|---|---|---|
Primary Function | Binary verification of instance operational status (up/down, ready/not-ready). | Continuous measurement of a specific performance attribute (e.g., latency, error rate). | A target reliability goal defined as a threshold on an SLI over time. | A high-level, user-centric metric for holistic service health. |
Measurement Granularity | Per service instance or pod. | Aggregated across the service (e.g., all requests). | Aggregated across the service over a compliance period. | Aggregated across the service. |
Output/Result | Boolean (pass/fail). Typically triggers orchestration actions (restart, drain). | Raw time-series data (e.g., latency histogram, error count). | Boolean (SLO met/violated) over an evaluation window. Defines an error budget. | A numeric value or status used for dashboarding and high-level alerting. |
Typical Implementation | Lightweight HTTP/HTTPS/TCP endpoint or command executed by the orchestrator (K8s). | Instrumentation in application code or service mesh (metrics from Prometheus, Datadog). | Calculated from SLI data using a tool like Google Cloud SLO, Nobl9, or custom pipelines. | Derived from core infrastructure and application metrics (often the four signals: latency, traffic, errors, saturation). |
Direct Action Trigger | Yes. Immediate, automated instance-level remediation (restart, reschedule). | No. Provides data for alerting and SLO calculation, but not direct remediation. | Yes. Triggers organizational and process actions (freeze on deploys, error budget discussions). | Yes. Triggers human investigation and broad operational response. |
AI/ML Specificity | Generic. Ensures the model server/agent container is running and reachable. | Highly specific. Can be model inference latency (p99), token throughput, hallucination rate, or retrieval precision. | Highly specific. Defines acceptable bounds for AI quality (e.g., <1% hallucination rate, p95 latency <500ms). | Generic. Applied to AI services (e.g., error rate for inference endpoints, saturation of GPU memory). |
Relation to Error Budget | Indirect. Failures contribute to service-level error rates, which consume the budget. | Direct. The measured value (e.g., error rate) is the input for calculating budget consumption. | Direct. Defines the total error budget (100% - SLO%). Burn rate is calculated against it. | Indirect. Golden signal anomalies may indicate conditions leading to SLO burn. |
Example | HTTP GET /health returns 200 OK. Container is scheduled to receive traffic. | SLI: Proportion of LLM inference requests with latency < 1 second. | SLO: 99.9% of LLM inference requests have latency < 1 second over a 30-day window. | Signal: Saturation. GPU memory utilization > 85% for 5 minutes. |
Frequently Asked Questions
Questions and answers about implementing health checks for AI-powered services, a foundational practice for establishing reliable Service Level Objectives (SLOs) and Indicators (SLIs).
A health check is a periodic probe or request sent to a service instance to verify its operational status and readiness to receive traffic, often implemented as liveness and readiness probes in containerized environments. For AI services, this extends beyond basic HTTP status codes to validate core dependencies like model servers, vector databases, and GPU availability. A comprehensive health check ensures the entire inference pipeline—from input validation to token generation—is functional before the service is marked healthy and included in a load balancer's pool. This is the first line of defense for meeting Service Level Objectives (SLOs) related to availability and error rate.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A health check is a foundational operational probe, but managing AI services requires a comprehensive framework of quantitative objectives and indicators. These related terms define the specific metrics, targets, and strategies for ensuring AI service reliability and performance.
Service Level Objective (SLO)
A Service Level Objective (SLO) is a quantitative target for the reliability, performance, or quality of a service, expressed as a percentage of requests that must meet a specific Service Level Indicator (SLI) over a defined time window. For AI services, SLOs are critical for defining acceptable performance bounds.
- Example: "99.9% of inference requests must have a latency under 100ms over a 30-day window."
- SLOs are internal goals, distinct from external Service Level Agreements (SLAs) which carry contractual penalties.
- Setting appropriate SLOs involves balancing user experience with engineering feasibility and cost.
Service Level Indicator (SLI)
A Service Level Indicator (SLI) is a directly measurable metric that quantifies a specific aspect of a service's performance. It is the raw measurement that an SLO is based upon. For AI systems, SLIs must be carefully chosen to reflect user-perceived quality.
- Common AI SLIs: Model inference latency, error rate (e.g., 5xx HTTP errors), throughput (requests per second), and task-specific quality metrics like answer faithfulness or retrieval precision.
- An SLI is always defined with a measurement method and aggregation window (e.g., average latency over 1 minute).
- Effective SLIs are user-centric, measuring outcomes that directly impact the end-user experience.
Error Budget
An error budget is the allowable amount of service unreliability, calculated as 100% - SLO. It defines the risk a team can accept for deploying new features or making changes without violating the SLO.
- If an SLO is 99.9% reliability, the error budget is 0.1% unreliability over the compliance period.
- The budget can be spent on planned risk (like deployments) or consumed by unplanned incidents.
- Burn rate monitors how quickly the error budget is being consumed, triggering alerts if exhaustion is imminent. This shifts focus from "is something broken?" to "are we at risk of breaking our promises?"
Golden Signal
A golden signal is one of four fundamental metrics used in Site Reliability Engineering (SRE) to comprehensively monitor the health and performance of any service: Latency, Traffic, Errors, and Saturation.
- Latency: The time it takes to service a request (e.g., p95 inference time).
- Traffic: A measure of demand (e.g., queries per second, token generation rate).
- Errors: The rate of failed requests (e.g., non-2xx HTTP statuses, model hallucination rate).
- Saturation: How "full" the service is (e.g., GPU utilization, queue depth). These signals provide a holistic view beyond simple uptime and are the primary sources for defining SLIs.
Canary Deployment
A canary deployment is a release strategy where a new version of a service (e.g., an updated ML model) is deployed to a small, representative subset of users or traffic. Its performance is closely monitored against SLIs before a full rollout.
- This technique is essential for validating SLO compliance of new changes in a low-risk, production-like environment.
- Production Canary Analysis involves comparing the canary's SLI performance (latency, error rate) against a stable baseline.
- If the canary violates error budget policies, the deployment is automatically rolled back, preventing a widespread SLO breach.
Tail Latency (p95, p99)
Tail latency, measured by percentiles like p95 or p99, represents the maximum latency experienced by the slowest 5% or 1% of requests. It is critical for AI services because user perception is often defined by worst-case performance.
- A p99 latency of 500ms means 99% of requests are faster than 500ms, and 1% are slower.
- Tail Latency Amplification in distributed AI systems can cause p99 to be orders of magnitude worse than p50 due to queuing, garbage collection, or dependency chains.
- SLOs for user-facing AI features must explicitly target tail latency percentiles, not just averages, to ensure a consistently good experience.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us