Inferensys

Glossary

Readiness Probe

A Kubernetes mechanism that determines if a container is ready to serve requests; if the probe fails, the container's pod is removed from service endpoints.
Knowledge manager reviewing enterprise knowledge management system on laptop, document library visible, casual office.
KUBERNETES DEPLOYMENT

What is a Readiness Probe?

A readiness probe is a Kubernetes health check that determines if a container is ready to serve network requests.

A readiness probe is a Kubernetes mechanism that determines if a container is ready to serve requests. If the probe fails, the container's pod is removed from Service endpoints, preventing traffic from being sent to an unprepared instance. This is distinct from a liveness probe, which determines if a container needs to be restarted. Probes can be configured as HTTP requests, TCP socket checks, or command executions within the container.

Readiness probes are critical for managing application startup dependencies, such as database connections or cache warming, and for implementing rolling updates and canary deployments without downtime. They ensure traffic is only directed to pods that are fully operational, maintaining the stability of the overall service. Proper configuration is essential for achieving high availability and graceful handling of partial failures in a microservices architecture.

KUBERNETES MECHANISM

Key Characteristics of Readiness Probes

Readiness probes are a core Kubernetes health check mechanism that determines if a container is ready to serve network traffic. Unlike liveness probes, they do not restart containers but manage traffic routing.

01

Purpose: Traffic Routing Gatekeeper

A readiness probe's primary function is to act as a traffic gatekeeper. When a probe succeeds, the container's pod is added as an endpoint to the associated Kubernetes Service. If it fails, the pod is removed from service endpoints, preventing traffic from being sent to an unprepared or malfunctioning container. This is distinct from a liveness probe, which determines if a container should be restarted.

02

Probe Types & Configuration

Kubernetes supports three primary probe mechanisms, defined in a container's spec:

  • HTTP GET Probe: Sends an HTTP request to a specified path and port; succeeds on a 2xx or 3xx status code.
  • TCP Socket Probe: Attempts to open a TCP connection to a specified port; succeeds if the connection is established.
  • Exec Probe: Executes a specified command inside the container; succeeds if the command exits with status code 0.

Key configuration parameters include initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold.

03

Integration with Deployment Strategies

Readiness probes are essential for safe rolling updates and canary deployments. During a rolling update, new pods are created but only receive traffic once their readiness probe passes. If a new pod's probe fails, the update pauses, preventing a cascading failure. This allows for progressive delivery by ensuring only healthy pods serve user requests, enabling zero-downtime deployments.

04

Critical for Stateful Services

Probes are crucial for stateful applications like databases, caches, and LLM inference servers that require initialization. For example, a vector database pod may need to load a large index into memory before it can serve queries. A correctly configured readiness probe with a sufficient initialDelaySeconds prevents the Service from routing traffic until this warm-up is complete, avoiding timeouts and errors for end-users.

05

Common Pitfalls & Anti-Patterns

Misconfiguration leads to operational issues:

  • Shallow Checks: Probing a trivial endpoint (e.g., /) that doesn't verify dependent services (e.g., model loaded, database connection).
  • Insufficient Initial Delay: Causing premature failure for services with long startup times.
  • Overlapping with Liveness: Using the same check for both readiness and liveness can cause unnecessary restarts under load.
  • Resource-Intensive Checks: An exec probe running a heavy script can exacerbate resource pressure on a struggling pod.
06

Related Concepts in Traffic Management

Readiness probes operate within a broader ecosystem of traffic and deployment controls:

  • Liveness Probe: Determines if a container should be restarted (survival).
  • Health Check: A generic term for any endpoint or check used to assess application status.
  • Load Balancer: Relies on readiness probe status to update its pool of healthy backends.
  • Service Mesh: Often provides more advanced health checking and traffic shifting capabilities on top of basic Kubernetes probes.
KUBERNETES DEPLOYMENT

How a Readiness Probe Works

A readiness probe is a Kubernetes health check that determines if a container is ready to serve network requests, ensuring traffic is only routed to fully initialized and healthy pods.

A readiness probe is a Kubernetes mechanism that periodically tests a container within a pod to verify it is ready to accept traffic. If the probe succeeds, the pod's IP address is added to the list of valid endpoints for the associated Service. If it fails, the pod is temporarily removed from the Service's endpoint list, preventing the kube-proxy and ingress controllers from routing requests to it. This ensures users are not directed to a pod that is still starting up, crashing, or overloaded.

Probes are configured in the pod specification and can execute one of three actions: an HTTP GET request, a TCP socket check, or a command executed inside the container. The probe's success is critical for implementing zero-downtime deployments like rolling updates and canary deployments, as it allows new pods to become ready before receiving traffic and gracefully drains connections from terminating pods. Proper configuration prevents cascading failures and is a foundational practice for high availability in microservices architectures.

KUBERNETES HEALTH CHECKS

Readiness Probe vs. Liveness Probe

A comparison of the two primary health check mechanisms used by Kubernetes to manage container lifecycle and traffic routing.

FeatureReadiness ProbeLiveness Probe

Primary Purpose

Determines if a container is ready to serve requests.

Determines if a container is running.

Probe Failure Action

Removes the pod's IP address from all Service endpoints.

The kubelet kills and restarts the container.

Impact on Traffic

Stops sending new traffic to the pod; existing connections may continue.

Terminates all traffic to the container and initiates a restart.

Typical Use Case

Waiting for dependencies (database, cache) to be ready, or warming up large models.

Detecting a deadlock or hung process where the app is running but unresponsive.

Common Check Types

HTTP GET, TCP Socket, Exec command

HTTP GET, TCP Socket, Exec command

Default Configuration

No default; must be explicitly defined.

No default; must be explicitly defined.

Configuration Parameters

initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, failureThreshold

initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, failureThreshold

Effect on Deployment Rollouts

A failing probe will prevent a new pod from receiving traffic during a rolling update, potentially stalling the rollout.

A failing probe will cause pod restarts, which can help recover a stuck deployment but may cause churn if misconfigured.

KUBERNETES DEPLOYMENT

Readiness Probes in LLM Operations

A readiness probe is a Kubernetes health check that determines if a containerized application, such as an LLM inference service, is fully initialized and ready to accept network traffic. If the probe fails, the container's pod is temporarily removed from service endpoint lists, preventing requests from being sent to an unprepared instance.

01

Core Mechanism & Purpose

A readiness probe is a periodic check executed by the kubelet (the node agent) against a container. Its primary purpose is to signal when a pod is ready to start serving traffic. Unlike a liveness probe (which determines if a container should be restarted), a failed readiness probe does not restart the container; it simply removes the pod's IP address from the endpoints of all matching Services. This prevents load balancers from routing user requests to pods that are still booting, loading a large model, or warming up caches, ensuring users only hit healthy, responsive endpoints.

  • Probe Types: HTTP GET, TCP Socket, or Exec command.
  • Critical for LLMs: Large models can take minutes to load into GPU memory; a readiness probe gates traffic until this is complete.
02

Configuration Parameters

Readiness probes are defined in a pod's container specification with several key parameters that control their behavior:

  • initialDelaySeconds: The number of seconds to wait after the container starts before initiating probes. Crucial for LLMs to allow for model loading.
  • periodSeconds: How often (in seconds) to perform the probe.
  • timeoutSeconds: Number of seconds after which the probe times out.
  • successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed.
  • failureThreshold: Number of consecutive failures required for the probe to be considered failed.

Example Configuration for an LLM API:

yaml
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 180  # Allow 3 minutes for model load
  periodSeconds: 10
  failureThreshold: 3
03

Integration with Traffic Management

The readiness probe is a foundational component for advanced traffic and deployment strategies. Its status directly interfaces with Kubernetes Services and Ingress controllers.

  • Service Endpoints: A Kubernetes Service continuously watches for pods matching its selector. It only adds pods with passing readiness probes to its active endpoints list.
  • Rolling Updates & Canary Deployments: During a rolling update, new pods are created. They remain isolated from live traffic until their readiness probes pass, ensuring a smooth transition. For canary deployments, readiness probes validate the new version's health before any traffic is shifted via traffic splitting.
  • Load Balancers: Cloud load balancers (e.g., AWS ALB, GCP Cloud Load Balancing) that integrate with Kubernetes use endpoint readiness to determine where to send requests, preventing 503 errors and connection drops.
04

Designing Effective Probes for LLMs

A well-designed readiness check for an LLM service must verify more than just a running process. It should assert the application is functionally ready.

Best Practices:

  • Endpoint Logic: The /health/ready endpoint should check critical dependencies: model loaded in memory, vector database connection established, GPU availability, and any warm-up caches (e.g., KV caches for inference optimization).
  • Lightweight & Fast: The check must be computationally cheap and return quickly (e.g., < 1 second) to avoid becoming a bottleneck.
  • Distinct from Liveness: Use a separate /health/live endpoint for liveness that checks if the process is running, while readiness performs deeper checks.
  • Graceful Shutdown: During termination, the kubelet stops sending traffic as soon as the pod is marked for deletion, but the probe should also be designed to fail quickly if the service is draining connections.
05

Common Failure Scenarios & Debugging

Understanding why a readiness probe fails is key to maintaining high availability.

Typical Failure Causes:

  • Insufficient initialDelaySeconds: The most common issue. The probe starts before the LLM finishes loading its weights.
  • Resource Contention: The pod may be CPU throttled or waiting for a GPU, causing health check timeouts.
  • External Dependency Failure: The probe checks a downstream service (e.g., a tokenizer service, feature store) that is unavailable.
  • Buggy Probe Logic: The health check itself has an error or infinite loop.

Debugging Commands:

  • kubectl describe pod <pod-name>: View probe status, last error, and events.
  • kubectl logs <pod-name>: Check application logs for errors during startup.
  • kubectl exec -it <pod-name> -- curl localhost:8000/health/ready: Manually test the probe endpoint from within the pod.
06

Related Operational Concepts

Readiness probes do not operate in isolation; they are part of a broader ecosystem of cloud-native resilience patterns.

  • Liveness Probe: Determines if a container needs to be restarted. Used for catching deadlocks or stalled processes where the app is running but not functional.
  • Startup Probe: Used for legacy applications that require extra long startup times. Disables liveness and readiness checks until it succeeds once.
  • Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas based on metrics. New pods created by the HPA must pass their readiness probes before receiving traffic.
  • Service Mesh (e.g., Istio): Often adds its own layer of health checking and outlier detection, which can work in concert with or independently of Kubernetes readiness probes.
  • Circuit Breaker: A pattern implemented at the service mesh or application level to stop sending requests to a failing pod, complementing the probe's traffic removal function.
KUBERNETES DEPLOYMENT

Frequently Asked Questions

A Readiness Probe is a critical Kubernetes health check that determines if a container is ready to accept network traffic. This FAQ addresses its core function, configuration, and role in ensuring high availability for LLM-powered applications.

A Readiness Probe is a Kubernetes mechanism that determines if a container within a pod is ready to serve requests. If the probe succeeds, the pod's IP address is added to the endpoints of the matching Service, making it eligible to receive traffic. If the probe fails, the pod is removed from the Service's endpoints, preventing traffic from being routed to an unhealthy or initializing container.

This is distinct from a Liveness Probe, which determines if a container is running (and restarts it if not). The readiness probe manages traffic flow, while the liveness probe manages container lifecycle.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.