A health probe is a diagnostic mechanism, such as a liveness or readiness check, used by an orchestrator (e.g., Kubernetes) to determine the operational status of a service or container. It performs periodic requests to a defined endpoint, evaluating the response against success criteria to decide if an instance is healthy and capable of receiving traffic. This enables automatic failure detection and triggers recovery actions like restarting or draining unhealthy pods, forming the foundation for resilient, self-healing software systems.
Glossary
Health Probe

What is a Health Probe?
A health probe is a diagnostic check used by an orchestrator to determine the operational status of a service or container.
In practice, a liveness probe determines if a container needs to be restarted, while a readiness probe controls its inclusion in a service's load balancer. These probes integrate with broader fault-tolerant patterns like circuit breakers and graceful degradation. For autonomous agents, analogous agentic health checks assess logical soundness and operational readiness, ensuring the system can maintain service level objectives (SLOs) by preemptively isolating failures before they cascade.
Key Characteristics of Health Probes
Health probes are the foundational diagnostic mechanism for autonomous systems, enabling orchestrators to make deterministic decisions about service availability and lifecycle management without human intervention.
Probe Types: Liveness vs. Readiness
Health probes are categorized by their operational purpose. A liveness probe determines if a container or process is running. A failure typically triggers a restart. A readiness probe determines if a container is ready to accept traffic (e.g., dependencies initialized, warm caches loaded). A failure prevents traffic from being sent to the pod. A third type, the startup probe, is used for legacy applications with long initialization times, disabling liveness/readiness checks until it succeeds.
Probe Mechanisms & Execution
Probes are executed by the orchestrator's kubelet against a container according to a defined schedule. The three primary mechanisms are:
- HTTP GET: The most common. The kubelet sends an HTTP request to a specified path and port. A success code (200-399) passes the probe.
- TCP Socket: The kubelet attempts to open a TCP connection to a specified port. Success is established if a connection is made.
- Exec Command: The kubelet executes a specified command inside the container. A zero exit code indicates success.
Configuration Parameters for Resilience
Probe behavior is finely tuned via parameters to balance responsiveness with stability, preventing flapping (rapid, cyclical failures). Key parameters include:
- initialDelaySeconds: Wait time after container start before initiating probes.
- periodSeconds: How often to perform the probe.
- timeoutSeconds: Number of seconds after which the probe times out.
- successThreshold: Minimum consecutive successes for the probe to be considered successful after a failure.
- failureThreshold: Number of consecutive failures required for the probe to be considered failed.
Integration with Orchestrator Lifecycle
Probes are integral to the container orchestrator's control loops. In Kubernetes, probe results directly inform the decisions of core controllers:
- The kubelet uses liveness probes to decide when to restart a container.
- The kubelet uses readiness probes to add or remove a pod's IP from the endpoints list of a matching Service.
- The Deployment controller considers pod readiness during rolling updates, ensuring new pods are ready before scaling down old ones. This creates a deterministic, self-healing feedback loop.
Designing Effective Probe Endpoints
A well-designed probe endpoint is lightweight, stateless, and checks critical internal dependencies. Best practices include:
- Checking internal in-memory state or a local cache.
- Performing a shallow check on a crucial downstream dependency (e.g., database connection pool).
- Avoiding deep dependency checks that cascade failures or heavy computational logic that consumes significant resources. The endpoint should return quickly to avoid blocking the orchestrator's control loop.
Relation to Circuit Breakers & Observability
Health probes operate at the infrastructure layer, while patterns like the Circuit Breaker operate at the application layer. A circuit breaker trips based on business logic failure rates, while a readiness probe fails on a technical health check. Together, they provide layered fault tolerance. Probe metrics (success/failure counts, latency) are critical observability signals, feeding into dashboards and alerts to provide a real-time view of system resilience and the effectiveness of self-healing mechanisms.
How Health Probes Work
A health probe is a diagnostic check used by an orchestrator to determine the operational status of a service or container, enabling autonomous failure detection and recovery.
A health probe is a diagnostic check, such as a liveness or readiness check, used by an orchestrator to determine the operational status of a service or container. It functions as the primary feedback mechanism for self-healing software systems, allowing platforms like Kubernetes to automatically restart, terminate, or route traffic away from unhealthy instances. This creates a closed-loop system where the platform's state is continuously reconciled with a declared desired state.
Probes execute by periodically making a request—such as an HTTP call, TCP socket connection, or command execution—to a predefined endpoint within the application. Based on the response (success, failure, or timeout), the orchestrator takes corrective execution path adjustment. For example, a failed liveness probe triggers a container restart, while a failed readiness probe removes the pod from service load balancers, enabling graceful degradation and preventing cascading failures.
Liveness vs. Readiness Probes: A Comparison
A detailed comparison of the two primary health probe types used by container orchestrators like Kubernetes to manage container lifecycle and traffic routing.
| Probe Feature | Liveness Probe | Readiness Probe |
|---|---|---|
Primary Purpose | Determine if the container process is alive and running. A failure triggers a container restart. | Determine if the container is ready to accept network traffic (e.g., HTTP requests). A failure removes the pod from service endpoints. |
Failure Action | The kubelet kills the container and restarts it according to the pod's | The kubelet stops routing traffic to the pod. The pod's IP address is removed from the endpoints of all matching Services. |
Typical Check Logic | A simple check that the main process is responsive (e.g., a basic TCP connection, HTTP request to a non-critical endpoint). | A check that all dependencies are initialized and ready (e.g., database connections are live, cache is warmed, large files are loaded). |
Probe Timing | Starts after | Starts after |
Configuration Parameters (e.g., in Kubernetes) |
|
|
Impact on System State | Stateful. A restart resets in-memory state and terminates existing connections. | Stateless. No container restart; existing in-flight requests may complete if the pod is not terminated. |
Common Implementation | HTTP GET request to a | HTTP GET request to a |
Design Principle | Follows the "Let-it-Crash" philosophy. If unhealthy, restart to reach a clean state. | Enables graceful degradation and load shedding. Protects the service from traffic it cannot handle. |
Where Health Probes Are Used
Health probes are a fundamental mechanism for building resilient, self-healing systems. They are implemented across the entire software stack, from container orchestration to application logic.
Frequently Asked Questions
A health probe is a diagnostic mechanism used by orchestrators like Kubernetes to assess the operational status of a service instance. This glossary addresses common technical questions about their implementation, purpose, and role in self-healing architectures.
A health probe is a diagnostic check, such as a liveness or readiness probe, used by an orchestrator to determine the operational status of a service or container. It works by periodically sending a request—typically an HTTP GET, a TCP socket connection, or an executed command—to a predefined endpoint within the application. The orchestrator evaluates the response (or timeout) against configured success criteria to decide if the instance is healthy and capable of receiving traffic, or if it requires restarting or removal from the service pool.
Key Mechanism:
- Orchestrator Initiated: The platform control plane (e.g., the kubelet in Kubernetes) executes the probe.
- Defined Endpoint: The application must expose a specific path (e.g.,
/health) or port for the check. - Configurable Parameters: Critical settings include
initialDelaySeconds,periodSeconds,timeoutSeconds,successThreshold, andfailureThreshold. - Binary Decision: Based on the probe result, the orchestrator takes a deterministic action: keep the pod in service, restart it, or mark it as not ready.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A health probe is a fundamental diagnostic mechanism within resilient architectures. These related concepts define the broader ecosystem of patterns and protocols that enable autonomous detection, isolation, and recovery from failures.
Circuit Breaker Pattern
A software design pattern that prevents an application from repeatedly attempting an operation likely to fail. It acts as a proxy for operations, monitoring for failures. When failures exceed a threshold, the circuit opens, failing fast and preventing cascading failures and resource exhaustion. After a timeout, it enters a half-open state to test if the underlying problem has resolved before closing again. This pattern is critical for building fault-tolerant microservices and is a complementary control mechanism to health probes.
Heartbeat Signal
A periodic, lightweight message sent from a subordinate component (like a service instance) to a monitoring system or orchestrator to indicate liveness. Unlike a health probe, which is an active check initiated by the orchestrator, a heartbeat is a passive, push-based signal. If heartbeats stop arriving, the monitoring system infers the component has failed. Heartbeats are often used in conjunction with probes for comprehensive liveness detection in distributed systems like Kubernetes (kubelet node status) and consensus algorithms like Raft.
Bulkhead Pattern
A fault isolation design inspired by ship compartments. It partitions system resources (e.g., thread pools, connections, memory) into isolated groups for different consumers or operations. A failure in one bulkhead (e.g., a downstream service timeout exhausting its dedicated connection pool) does not cascade and drain resources from unrelated parts of the system. This pattern ensures graceful degradation. Health probes often operate within a specific bulkhead, and their failure should only affect the associated partitioned resources.
Leader Election
A distributed coordination process where nodes in a cluster autonomously agree on a single node to act as the leader or coordinator. The leader typically manages critical tasks like assigning work or maintaining consensus. Health probes (or heartbeats) are fundamental to this process: the failure of a leader's health check triggers a new election. Algorithms like Raft and Paxos implement robust election protocols. This ensures continuous operation and is a core pattern for high-availability systems like databases (etcd, Consul) and orchestrators.
Reconciliation Loop
A fundamental control loop in declarative systems like Kubernetes. It continuously observes the actual state of the world (e.g., pod statuses from health probes) and compares it to the declared desired state (e.g., a deployment manifest). It then computes and executes the necessary actions (kill, create, restart) to converge the actual state with the desired state. Health probes provide the critical observability signal that drives this loop. This pattern is central to GitOps and self-healing infrastructure, enabling autonomous recovery from drift and failure.
Let-It-Crash Philosophy
A fault-tolerance philosophy central to the Erlang/OTP and Actor models. Instead of writing complex defensive code to handle every possible internal error, processes are designed to fail fast upon encountering an unexpected condition. A supervising process, equipped with a restart strategy (e.g., one-for-one, rest-for-one), detects the crash (via a monitoring mechanism analogous to a health probe) and restarts the failed process from a clean state. This creates resilient systems where failure is isolated and recovery is automated, aligning with the goals of health probes in container orchestrators.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us