Glossary

Health Probe

A health probe is a diagnostic check, such as a liveness or readiness check, used by an orchestrator to determine the operational status of a service or container.

Get in touch Learn more

Strategy consultant facilitating AI use case discovery workshop, sticky notes on glass wall, casual corporate meeting.

SELF-HEALING SOFTWARE SYSTEMS

What is a Health Probe?

A health probe is a diagnostic check used by an orchestrator to determine the operational status of a service or container.

A health probe is a diagnostic mechanism, such as a liveness or readiness check, used by an orchestrator (e.g., Kubernetes) to determine the operational status of a service or container. It performs periodic requests to a defined endpoint, evaluating the response against success criteria to decide if an instance is healthy and capable of receiving traffic. This enables automatic failure detection and triggers recovery actions like restarting or draining unhealthy pods, forming the foundation for resilient, self-healing software systems.

In practice, a liveness probe determines if a container needs to be restarted, while a readiness probe controls its inclusion in a service's load balancer. These probes integrate with broader fault-tolerant patterns like circuit breakers and graceful degradation. For autonomous agents, analogous agentic health checks assess logical soundness and operational readiness, ensuring the system can maintain service level objectives (SLOs) by preemptively isolating failures before they cascade.

SELF-HEALING SOFTWARE SYSTEMS

Key Characteristics of Health Probes

Health probes are the foundational diagnostic mechanism for autonomous systems, enabling orchestrators to make deterministic decisions about service availability and lifecycle management without human intervention.

Probe Types: Liveness vs. Readiness

Health probes are categorized by their operational purpose. A liveness probe determines if a container or process is running. A failure typically triggers a restart. A readiness probe determines if a container is ready to accept traffic (e.g., dependencies initialized, warm caches loaded). A failure prevents traffic from being sent to the pod. A third type, the startup probe, is used for legacy applications with long initialization times, disabling liveness/readiness checks until it succeeds.

Probe Mechanisms & Execution

Probes are executed by the orchestrator's kubelet against a container according to a defined schedule. The three primary mechanisms are:

HTTP GET: The most common. The kubelet sends an HTTP request to a specified path and port. A success code (200-399) passes the probe.
TCP Socket: The kubelet attempts to open a TCP connection to a specified port. Success is established if a connection is made.
Exec Command: The kubelet executes a specified command inside the container. A zero exit code indicates success.

Configuration Parameters for Resilience

Probe behavior is finely tuned via parameters to balance responsiveness with stability, preventing flapping (rapid, cyclical failures). Key parameters include:

initialDelaySeconds: Wait time after container start before initiating probes.
periodSeconds: How often to perform the probe.
timeoutSeconds: Number of seconds after which the probe times out.
successThreshold: Minimum consecutive successes for the probe to be considered successful after a failure.
failureThreshold: Number of consecutive failures required for the probe to be considered failed.

Integration with Orchestrator Lifecycle

Probes are integral to the container orchestrator's control loops. In Kubernetes, probe results directly inform the decisions of core controllers:

The kubelet uses liveness probes to decide when to restart a container.
The kubelet uses readiness probes to add or remove a pod's IP from the endpoints list of a matching Service.
The Deployment controller considers pod readiness during rolling updates, ensuring new pods are ready before scaling down old ones. This creates a deterministic, self-healing feedback loop.

Designing Effective Probe Endpoints

A well-designed probe endpoint is lightweight, stateless, and checks critical internal dependencies. Best practices include:

Checking internal in-memory state or a local cache.
Performing a shallow check on a crucial downstream dependency (e.g., database connection pool).
Avoiding deep dependency checks that cascade failures or heavy computational logic that consumes significant resources. The endpoint should return quickly to avoid blocking the orchestrator's control loop.

Relation to Circuit Breakers & Observability

Health probes operate at the infrastructure layer, while patterns like the Circuit Breaker operate at the application layer. A circuit breaker trips based on business logic failure rates, while a readiness probe fails on a technical health check. Together, they provide layered fault tolerance. Probe metrics (success/failure counts, latency) are critical observability signals, feeding into dashboards and alerts to provide a real-time view of system resilience and the effectiveness of self-healing mechanisms.

SELF-HEALING SOFTWARE SYSTEMS

How Health Probes Work

A health probe is a diagnostic check used by an orchestrator to determine the operational status of a service or container, enabling autonomous failure detection and recovery.

A health probe is a diagnostic check, such as a liveness or readiness check, used by an orchestrator to determine the operational status of a service or container. It functions as the primary feedback mechanism for self-healing software systems, allowing platforms like Kubernetes to automatically restart, terminate, or route traffic away from unhealthy instances. This creates a closed-loop system where the platform's state is continuously reconciled with a declared desired state.

Probes execute by periodically making a request—such as an HTTP call, TCP socket connection, or command execution—to a predefined endpoint within the application. Based on the response (success, failure, or timeout), the orchestrator takes corrective execution path adjustment. For example, a failed liveness probe triggers a container restart, while a failed readiness probe removes the pod from service load balancers, enabling graceful degradation and preventing cascading failures.

KUBERNETES HEALTH CHECKS

Liveness vs. Readiness Probes: A Comparison

A detailed comparison of the two primary health probe types used by container orchestrators like Kubernetes to manage container lifecycle and traffic routing.

Probe Feature	Liveness Probe	Readiness Probe
Primary Purpose	Determine if the container process is alive and running. A failure triggers a container restart.	Determine if the container is ready to accept network traffic (e.g., HTTP requests). A failure removes the pod from service endpoints.
Failure Action	The kubelet kills the container and restarts it according to the pod's `restartPolicy`.	The kubelet stops routing traffic to the pod. The pod's IP address is removed from the endpoints of all matching Services.
Typical Check Logic	A simple check that the main process is responsive (e.g., a basic TCP connection, HTTP request to a non-critical endpoint).	A check that all dependencies are initialized and ready (e.g., database connections are live, cache is warmed, large files are loaded).
Probe Timing	Starts after `initialDelaySeconds`. Runs continuously for the container's lifetime.	Starts after `initialDelaySeconds`. Runs continuously for the container's lifetime.
Configuration Parameters (e.g., in Kubernetes)	`initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, `successThreshold`, `failureThreshold`	`initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, `successThreshold`, `failureThreshold`
Impact on System State	Stateful. A restart resets in-memory state and terminates existing connections.	Stateless. No container restart; existing in-flight requests may complete if the pod is not terminated.
Common Implementation	HTTP GET request to a `/healthz` endpoint, TCP socket check, or Exec command (e.g., `cat /tmp/healthy`).	HTTP GET request to a `/ready` endpoint, often with deeper dependency validation than the liveness endpoint.
Design Principle	Follows the "Let-it-Crash" philosophy. If unhealthy, restart to reach a clean state.	Enables graceful degradation and load shedding. Protects the service from traffic it cannot handle.

ARCHITECTURAL PATTERNS

Where Health Probes Are Used

Health probes are a fundamental mechanism for building resilient, self-healing systems. They are implemented across the entire software stack, from container orchestration to application logic.

Container Orchestration (Kubernetes)

In Kubernetes, health probes are the primary mechanism for determining container lifecycle state. The kubelet agent on each node executes probes against pods.

Liveness Probe: Determines if a container is running. A failed probe triggers a container restart.
Readiness Probe: Determines if a container is ready to serve traffic. A failed probe removes the pod's IP from Service endpoints.
Startup Probe: Used for legacy applications with long initialization times, disabling liveness/readiness checks until it succeeds.

Probes are defined in the Pod spec and can be HTTP GET requests, TCP socket connections, or command executions.

EXPLORE

Service Mesh & API Gateways

Service meshes like Istio and Linkerd, and API gateways like Kong or Envoy, use health probes for dynamic load balancing and circuit breaking.

Outlier Detection: Probes identify unhealthy endpoints (pods/instances) and eject them from the load balancing pool.
Traffic Shifting: During canary deployments, probes validate the health of new versions before shifting user traffic.
Passive Health Checking: Observes the success/failure rate of real user requests in addition to active probing.

This layer provides application-level health semantics beyond simple process liveness.

EXPLORE

Load Balancers & Service Discovery

Cloud load balancers (AWS ELB, GCP Cloud Load Balancing, Azure Load Balancer) and service discovery tools (Consul, etcd) rely on health checks to manage backend pools.

Instance Health: Determines which virtual machines or instances are eligible to receive traffic.
Draining: Unhealthy instances are gracefully drained (finish in-flight requests) before removal.
Global Server Load Balancing (GSLB): Probes can determine the health of entire geographic regions for DNS-based failover.

These probes operate at the network/transport layer, often checking TCP/HTTP connectivity on a specific port.

EXPLORE

Database & Stateful Service Clusters

Distributed databases (PostgreSQL with Patroni, Redis Sentinel, MongoDB replica sets) and message brokers (RabbitMQ, Kafka) use health probes for leader election and failover.

Replica Lag Monitoring: Probes check if a database replica is too far behind the primary to be promoted.
Quorum Health: Consensus-based systems use probes to determine if a quorum of nodes is alive for writes.
Resource Saturation: Probes can check disk space, memory pressure, or connection pool exhaustion.

Failure triggers automated failover or read-only mode to maintain availability.

EXPLORE

Application Self-Monitoring

Sophisticated applications implement internal health endpoints (/health, /ready, /live) that perform deep dependency checks.

Dependency Status: Verifies connectivity and latency to downstream databases, caches, and external APIs.
Warm-up State: Checks if in-memory caches are populated or JIT compilation is complete.
Circuit Breaker Integration: The health endpoint reflects the state of internal circuit breakers to dependencies.

This provides a holistic view of application readiness, not just process state. A common pattern is to have a /health/startup, /health/live, and /health/ready endpoint.

EXPLORE

Infrastructure & Platform Monitoring

Monitoring and observability platforms (Prometheus, Datadog, New Relic) use health probes as a core source of system telemetry.

Blackbox Monitoring: External probes simulate user requests from various geographic locations.
Synthetic Transactions: Probes execute critical user journeys to validate full-stack functionality.
Alerting Integration: Probe failures trigger PagerDuty alerts, Slack notifications, or automated runbooks.

This provides an external, customer-centric view of health, complementing internal orchestration probes.

EXPLORE

HEALTH PROBE

Frequently Asked Questions

A health probe is a diagnostic mechanism used by orchestrators like Kubernetes to assess the operational status of a service instance. This glossary addresses common technical questions about their implementation, purpose, and role in self-healing architectures.

A health probe is a diagnostic check, such as a liveness or readiness probe, used by an orchestrator to determine the operational status of a service or container. It works by periodically sending a request—typically an HTTP GET, a TCP socket connection, or an executed command—to a predefined endpoint within the application. The orchestrator evaluates the response (or timeout) against configured success criteria to decide if the instance is healthy and capable of receiving traffic, or if it requires restarting or removal from the service pool.

Key Mechanism:

Orchestrator Initiated: The platform control plane (e.g., the kubelet in Kubernetes) executes the probe.
Defined Endpoint: The application must expose a specific path (e.g., /health) or port for the check.
Configurable Parameters: Critical settings include initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold.
Binary Decision: Based on the probe result, the orchestrator takes a deterministic action: keep the pod in service, restart it, or mark it as not ready.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

SELF-HEALING SOFTWARE SYSTEMS

Related Terms

A health probe is a fundamental diagnostic mechanism within resilient architectures. These related concepts define the broader ecosystem of patterns and protocols that enable autonomous detection, isolation, and recovery from failures.

Circuit Breaker Pattern

A software design pattern that prevents an application from repeatedly attempting an operation likely to fail. It acts as a proxy for operations, monitoring for failures. When failures exceed a threshold, the circuit opens, failing fast and preventing cascading failures and resource exhaustion. After a timeout, it enters a half-open state to test if the underlying problem has resolved before closing again. This pattern is critical for building fault-tolerant microservices and is a complementary control mechanism to health probes.

Heartbeat Signal

A periodic, lightweight message sent from a subordinate component (like a service instance) to a monitoring system or orchestrator to indicate liveness. Unlike a health probe, which is an active check initiated by the orchestrator, a heartbeat is a passive, push-based signal. If heartbeats stop arriving, the monitoring system infers the component has failed. Heartbeats are often used in conjunction with probes for comprehensive liveness detection in distributed systems like Kubernetes (kubelet node status) and consensus algorithms like Raft.

Bulkhead Pattern

A fault isolation design inspired by ship compartments. It partitions system resources (e.g., thread pools, connections, memory) into isolated groups for different consumers or operations. A failure in one bulkhead (e.g., a downstream service timeout exhausting its dedicated connection pool) does not cascade and drain resources from unrelated parts of the system. This pattern ensures graceful degradation. Health probes often operate within a specific bulkhead, and their failure should only affect the associated partitioned resources.

Leader Election

A distributed coordination process where nodes in a cluster autonomously agree on a single node to act as the leader or coordinator. The leader typically manages critical tasks like assigning work or maintaining consensus. Health probes (or heartbeats) are fundamental to this process: the failure of a leader's health check triggers a new election. Algorithms like Raft and Paxos implement robust election protocols. This ensures continuous operation and is a core pattern for high-availability systems like databases (etcd, Consul) and orchestrators.

Reconciliation Loop

A fundamental control loop in declarative systems like Kubernetes. It continuously observes the actual state of the world (e.g., pod statuses from health probes) and compares it to the declared desired state (e.g., a deployment manifest). It then computes and executes the necessary actions (kill, create, restart) to converge the actual state with the desired state. Health probes provide the critical observability signal that drives this loop. This pattern is central to GitOps and self-healing infrastructure, enabling autonomous recovery from drift and failure.

Let-It-Crash Philosophy

A fault-tolerance philosophy central to the Erlang/OTP and Actor models. Instead of writing complex defensive code to handle every possible internal error, processes are designed to fail fast upon encountering an unexpected condition. A supervising process, equipped with a restart strategy (e.g., one-for-one, rest-for-one), detects the crash (via a monitoring mechanism analogous to a health probe) and restarts the failed process from a clean state. This creates resilient systems where failure is isolated and recovery is automated, aligning with the goals of health probes in container orchestrators.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Health Probe

What is a Health Probe?

Key Characteristics of Health Probes

Probe Types: Liveness vs. Readiness

Probe Mechanisms & Execution

Configuration Parameters for Resilience

Integration with Orchestrator Lifecycle

Designing Effective Probe Endpoints

Relation to Circuit Breakers & Observability

How Health Probes Work

Liveness vs. Readiness Probes: A Comparison

Where Health Probes Are Used

Container Orchestration (Kubernetes)

Service Mesh & API Gateways

Load Balancers & Service Discovery

Database & Stateful Service Clusters

Application Self-Monitoring

Infrastructure & Platform Monitoring

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there