Glossary

Liveness Probe

A liveness probe is a Kubernetes health check that determines if a container is still running and responsive; if it fails, the container is typically restarted.

Get in touch Learn more

Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.

AGENT DEPLOYMENT OBSERVABILITY

What is a Liveness Probe?

A liveness probe is a health check mechanism used by container orchestrators like Kubernetes to determine if an application instance is running and responsive.

A liveness probe is a periodic diagnostic test executed by a container orchestrator to verify that an application process is alive and not in a deadlocked or otherwise unhealthy state. If the probe fails repeatedly, the orchestrator's failure policy is triggered, typically resulting in the termination and restart of the container. This mechanism is a core component of self-healing infrastructure, ensuring that unresponsive services are automatically recovered without manual intervention.

Common probe types include HTTP GET requests to a designated endpoint, TCP socket connections, or the execution of a command inside the container. The probe's configuration defines its initial delay, period, timeout, and failure threshold. In the context of agentic observability, liveness probes are critical for monitoring the operational status of autonomous agents, ensuring they remain available to execute tasks and that any frozen or crashed processes are promptly restarted to maintain system reliability.

KUBERNETES HEALTH CHECKS

Types of Liveness Probes

Liveness probes are health checks that determine if a container is still running and responsive. If a probe fails, the container orchestrator (like Kubernetes) will typically restart it. Different probe types are suited for different application architectures.

HTTP GET Probe

The most common liveness probe type, which sends an HTTP GET request to a specified path and port on the container. The probe is considered successful if the server returns a response with an HTTP status code between 200 and 399.

Use Case: Ideal for web servers, REST APIs, and any service with an HTTP endpoint dedicated to health.
Configuration: Requires specifying path, port, and optionally httpHeaders.
Example: A probe checking http://localhost:8080/healthz. A 200 OK means the app is alive; a 404 or 500 triggers a restart.

TCP Socket Probe

A probe that attempts to open a TCP connection to a specified port on the container. Success is based solely on whether a TCP handshake can be established; no data is sent or received.

Use Case: Best for non-HTTP services like databases (PostgreSQL, Redis), custom TCP protocols, or gRPC services (where a simple port check suffices).
Configuration: Requires only the port number.
Key Consideration: It only confirms the port is listening, not that the service is functionally healthy. A deeper health check may require a different probe type.

Exec Probe

A probe that executes a specified command inside the container. The probe is successful if the command exits with a status code of 0; any other exit code indicates failure.

Use Case: For complex health checks that require custom logic, such as checking the state of a file, verifying internal process status, or running a diagnostic script.
Configuration: Defined by a command array in the pod spec.
Example: A command like ["pg_isready", "-U", "postgres"] to check PostgreSQL readiness. Caution: Exec probes consume more resources than HTTP/TCP probes and the command must be present in the container's filesystem.

gRPC Health Checking Probe

A specialized probe for containers that implement the gRPC Health Checking Protocol. It uses a standard gRPC health check request to a defined service.

Use Case: Native health checking for gRPC-based microservices. This is more efficient and accurate for gRPC than using a TCP socket probe.
Configuration: Requires port and optionally a service name. If no service is specified, it checks the overall health of the server.
Advantage: Provides service-level health granularity, allowing different services within the same gRPC server to report their status independently.

EXPLORE

Probe Timing Parameters

Beyond the type, liveness probes are governed by critical timing parameters that control their behavior and aggressiveness:

initialDelaySeconds: Wait time after container starts before probes begin. Crucial for apps with slow startup.
periodSeconds: How often to perform the probe (e.g., every 10 seconds).
timeoutSeconds: Time after which the probe is considered failed if no response.
successThreshold: Consecutive successes required to mark a failed container as healthy (defaults to 1 for liveness).
failureThreshold: Number of consecutive failures before the container is considered unhealthy and is restarted.

Misconfiguring these values is a common cause of unnecessary pod restarts or unresponsive containers.

Liveness vs. Readiness vs. Startup

Liveness probes are one of three health check types in Kubernetes, each with a distinct purpose:

Liveness Probe: Answers "Is the container alive?" Failure results in a container restart.
Readiness Probe: Answers "Is the container ready to serve traffic?" Failure removes the pod from service load balancers but does not restart it. Used for temporary unavailability (e.g., loading large cache).
Startup Probe: Answers "Has the application finished starting up?" Disables liveness/readiness checks until it succeeds once. Designed for legacy apps with lengthy, unpredictable startup times.

A robust deployment often uses a combination: a startup probe for initialization, then readiness for traffic management, and liveness as a last-resort restart mechanism.

KUBERNETES HEALTH CHECK COMPARISON

Liveness Probe vs. Readiness Probe vs. Startup Probe

A comparison of the three primary health check mechanisms used in Kubernetes to manage container lifecycle and traffic routing.

Feature	Liveness Probe	Readiness Probe	Startup Probe
Primary Purpose	Detects a deadlocked or unresponsive container.	Determines if a container is ready to serve traffic.	Guards slow-starting containers during initialization.
Probe Failure Action	The kubelet kills the container, and it is restarted per its restart policy.	The kubelet removes the pod's IP from all Service endpoints; traffic is not routed to it.	If it fails, the kubelet kills the container, which is then restarted.
Probe Success Action	No action; container continues running.	Pod is marked as Ready and added to Service endpoints.	The kubelet begins the liveness and readiness probes.
Typical Use Case	Application is running but stuck (e.g., deadlock).	Application is booting, loading large data, or temporarily overloaded.	Legacy applications with startup times exceeding initialDelaySeconds.
Impact on Traffic	Does not directly affect traffic routing.	Directly controls traffic routing via Service endpoints.	No impact; traffic is not routed until the startup probe succeeds and readiness probe passes.
Default Configuration	None; must be explicitly defined.	None; must be explicitly defined.	None; must be explicitly defined.
Common Check Types	HTTP GET, TCP Socket, Exec command.	HTTP GET, TCP Socket, Exec command.	HTTP GET, TCP Socket, Exec command.
Probe Timing	Runs continuously for the pod's entire lifetime after startup.	Runs continuously for the pod's entire lifetime after startup.	Runs only during the pod's initialization phase, before liveness/readiness begin.

LIVENESS PROBE

Key Configuration Parameters

A liveness probe is a health check mechanism used by container orchestrators like Kubernetes to determine if a container is still running and responsive. Its configuration dictates how, when, and under what conditions the system will attempt to restart an unresponsive application instance.

Probe Type

Defines the method used to check the container's liveness. The three primary types are:

HTTP GET Probe: Sends an HTTP request to a specified path and port. A successful response (HTTP status code between 200 and 399) indicates the container is alive.
TCP Socket Probe: Attempts to open a TCP connection to a specified port. Success on connection establishes liveness.
Exec Probe: Executes a specified command inside the container. A zero exit code from the command indicates success. The choice depends on the application's architecture; a web service typically uses an HTTP GET probe, while a database might use a TCP socket probe.

Initial Delay Seconds

The number of seconds to wait after the container starts before initiating the first liveness probe. This is a critical parameter for applications with slow initialization times (e.g., legacy monoliths, JVM-based services). Setting this too low will cause the orchestrator to kill the container before it has finished starting up, leading to a restart loop. A best practice is to set this value slightly higher than the worst-case startup time observed during testing.

Period Seconds

Specifies how often (in seconds) to perform the liveness probe after the initial delay. This defines the check frequency. A common default is 10 seconds. A shorter period detects failures more quickly but increases overhead. A longer period reduces overhead but allows an unhealthy container to serve traffic for a longer duration. The value should balance responsiveness with system load.

Timeout Seconds

The number of seconds after which the probe times out. If the probe does not complete within this window, it is recorded as a failure. This must be set lower than the periodSeconds. For HTTP GET probes, this is the maximum time to wait for an HTTP response. A timeout typically indicates the application is hung or severely degraded, not just slow.

Success & Failure Thresholds

These parameters introduce hysteresis, preventing flapping due to transient issues.

Success Threshold: The minimum number of consecutive successful probes required for a container that has previously failed to be considered healthy again. This is typically 1.
Failure Threshold: The number of consecutive probe failures required for the container to be considered unhealthy. A value greater than 1 (e.g., 3) allows the container to survive transient network glitches or brief garbage collection pauses without being restarted.

Endpoint Design & Idempotency

The endpoint or command targeted by the probe must be carefully designed. It should:

Perform a minimal, internal-state check (e.g., check a local cache, verify a thread pool is responsive).
Avoid complex logic or calls to downstream dependencies (databases, external APIs), as their failure does not necessarily mean the container itself is dead.
Be idempotent and have no side effects. A liveness probe should not change application state or trigger business logic. A poorly designed endpoint can lead to false positives and unnecessary restarts.

KUBERNETES HEALTH CHECKS

Frequently Asked Questions

A liveness probe is a fundamental health check mechanism in container orchestration platforms like Kubernetes. It ensures application availability by automatically restarting unresponsive containers. This FAQ addresses its operation, configuration, and role in production observability.

A liveness probe is a diagnostic check performed by a container orchestrator (like Kubernetes) to determine if a containerized application is still running and responsive. If the probe fails, the orchestrator terminates the container and restarts it according to its restart policy.

It works by periodically executing one of three types of checks against a target container:

HTTP GET Probe: Sends an HTTP request to a specified path and port. A success is typically a status code between 200 and 399.
TCP Socket Probe: Attempts to open a TCP connection to a specified port. Success is establishing a connection.
Exec Probe: Executes a specified command inside the container. A zero exit code indicates success.

The orchestrator uses the results to make lifecycle decisions, ensuring faulty containers are automatically healed.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

AGENT DEPLOYMENT OBSERVABILITY

Related Terms

Liveness probes are a core component of container health monitoring. Understanding related deployment and observability concepts is essential for managing reliable, autonomous agent systems.

Readiness Probe

A Kubernetes health check that determines if a container is fully initialized and ready to accept network traffic. Unlike a liveness probe, which restarts a non-responsive container, a failed readiness probe removes the pod's IP address from the service's endpoint list, stopping new traffic from being routed to it.

Purpose: Signal that the application has completed its startup sequence (e.g., loaded caches, opened database connections).
Failure Consequence: The pod is marked as "Not Ready" and is taken out of the load balancer pool.
Typical Commands: HTTP GET on a /ready endpoint, TCP socket check, or execution of a custom command script.

Startup Probe

A Kubernetes health check used for legacy applications or services with exceptionally long startup times. It disables the activity of liveness and readiness probes until the startup probe succeeds, preventing the orchestrator from killing the container before it has had a chance to fully start.

Purpose: Protect slow-starting containers from being restarted prematurely.
Failure Consequence: The container is killed and restarted according to its restart policy.
Use Case: Java applications with large heaps, monolithic legacy systems initializing many components.

Health Check (Generic)

A periodic diagnostic test performed by an orchestrator or monitoring system to verify an application instance is functioning correctly. In Kubernetes, this is implemented via the three probe types (liveness, readiness, startup). In broader cloud-native architectures, health checks are used by load balancers and service meshes to determine instance viability.

Core Mechanism: Active polling or passive analysis of application state.
Key Attributes: Check interval, timeout period, success/failure thresholds.
Broader Context: Part of the Service Level Indicator (SLI) definition for system availability.

Graceful Shutdown

The orderly termination process for an application, allowing it to complete in-flight requests, release resources, and persist state before the container runtime forces it to stop. This is often coordinated with a PreStop lifecycle hook in Kubernetes, which sends a SIGTERM signal, giving the application a configurable period to shut down before a SIGKILL.

Contrast with Liveness: Liveness ensures the app is alive; graceful shutdown ensures it dies cleanly.
Importance for Agents: Critical for agent state monitoring to prevent data loss or corruption when pods are rescheduled.
Kubernetes Hook: lifecycle.preStop

Circuit Breaker

A resiliency pattern that prevents an application from repeatedly attempting to call a failing service. If failures exceed a threshold, the circuit "opens," and requests fail fast for a period, allowing the downstream service time to recover. This is a complementary pattern to health checks.

Relation to Probes: While a liveness probe restarts a local container, a circuit breaker protects your service from remote failures.
Implementation: Common in service meshes (e.g., Istio) and client libraries (e.g., Resilience4j).
States: Closed (normal operation), Open (failing fast), Half-Open (testing for recovery).

Pod Disruption Budget (PDB)

A Kubernetes policy that limits the number of concurrent voluntary disruptions to pods belonging to an application, ensuring high availability during cluster maintenance operations like node drains or updates. It works in concert with liveness probes to maintain application quorum.

Voluntary vs. Involuntary: PDBs govern voluntary disruptions (user-initiated). Liveness probe failures cause involuntary disruptions.
Key Parameters: minAvailable or maxUnavailable pods.
Use Case for Agents: Ensures a minimum number of autonomous agent replicas remain running during cluster operations, preserving system capability.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.