A liveness probe is a Kubernetes health check mechanism that determines if a container is running and responsive, triggering an automatic restart if the probe fails. It is a fundamental component of self-healing software systems, allowing a container orchestrator to detect and remediate a hung or dead process without human intervention. Probes are typically configured as HTTP requests, TCP socket checks, or command executions within the container.
Glossary
Liveness Probe

What is a Liveness Probe?
A core mechanism for ensuring containerized applications remain responsive and can self-heal from runtime failures.
The probe operates by periodically executing a diagnostic test against a defined health endpoint. If consecutive failures exceed a configured threshold, the kubelet terminates the container and restarts it according to the pod's restart policy. This mechanism is distinct from a readiness probe, which controls traffic flow, as a liveness probe governs the container's lifecycle to maintain application availability within the broader recursive error correction framework.
Key Features of a Liveness Probe
A Liveness Probe is a Kubernetes health check mechanism that determines if a container is running and responsive. It is a core component of resilient, self-healing application deployment, automatically restarting containers that fail the probe.
Core Purpose: Detect Unresponsive Containers
The primary function of a liveness probe is to detect when an application inside a container has entered a broken state—such as a deadlock, infinite loop, or internal crash—where it is still running but cannot make progress or serve requests. Unlike a readiness probe, which checks if a container is ready to serve, a liveness probe checks if it should be restarted. A failed probe triggers the kubelet to kill the container, and the Pod's restartPolicy (usually Always) initiates a restart, aiming to restore service automatically.
Probe Types & Configuration
Liveness probes can be configured using one of three handlers, defined in the container's spec within the Pod manifest:
- HTTP GET Probe: The kubelet sends an HTTP GET request to a specified path and port. A success is any HTTP status code between 200 and 399. This is ideal for web services and APIs.
- TCP Socket Probe: The kubelet attempts to open a TCP connection to a specified port. Success is established if a connection can be made. Used for non-HTTP services like databases or custom TCP protocols.
- Exec Probe: The kubelet executes a specified command inside the container. The probe succeeds if the command exits with status code 0. This allows for custom, application-specific health logic.
Key configuration parameters include initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold.
Integration with Pod Lifecycle
The liveness probe operates within the broader Pod lifecycle. It typically starts after an optional initialDelaySeconds, allowing the application time to bootstrap. Once active, it runs periodically based on periodSeconds. A single failure does not immediately restart the container; the probe must fail failureThreshold consecutive times. This prevents unnecessary restarts from transient issues. Upon consecutive failures, the kubelet kills the container. The Pod's restartPolicy then governs the restart. If restarts continue rapidly (controlled by Kubernetes back-off logic), the Pod may enter a CrashLoopBackOff state.
Distinction from Readiness & Startup Probes
It is critical to distinguish liveness from other Kubernetes health checks:
- vs. Readiness Probe: A readiness probe determines if a container is ready to accept traffic. A failed readiness probe removes the Pod's IP from Service endpoints but does not restart the container. Use it for slow startups or temporary dependencies.
- vs. Startup Probe: Used for legacy applications with long initialization times. It disables liveness and readiness checks until it succeeds once. After that, liveness probes take over for the remainder of the container's lifecycle.
A common pattern: Use a startup probe for initial boot, a readiness probe for traffic management, and a liveness probe for crash recovery.
Design Best Practices & Anti-Patterns
Effective liveness probe design is crucial for system stability.
Best Practices:
- The check should be lightweight and fast, with a low
timeoutSeconds. - The endpoint or command should be internal and not depend on external dependencies (e.g., databases, downstream APIs).
- Use a dedicated, low-privilege health endpoint for HTTP probes.
- Set
initialDelaySecondsappropriately to avoid killing slow-starting apps.
Anti-Patterns to Avoid:
- Leaky Abstractions: A probe that fails due to a downstream database outage could cause unnecessary restarts of otherwise healthy application containers.
- Overly Sensitive Probes: Setting a low
failureThresholdor shortperiodSecondscan cause restart storms. - Heavy Computational Logic: An
execprobe that runs a complex script can consume significant CPU, affecting application performance.
Role in Self-Healing Systems
The liveness probe is a foundational reactive mechanism for self-healing software. It enables an application to automatically recover from certain internal software faults without human operator intervention, increasing overall system availability. This aligns with the Recursive Error Correction pillar by providing a basic, automated corrective action (restart) upon detecting a failure state. For more complex autonomous agents, liveness probes act as a circuit breaker at the container level, preventing a single faulty agent process from stalling an entire system. They are a key primitive in building fault-tolerant and resilient distributed systems where manual recovery is impractical at scale.
How a Liveness Probe Works
A liveness probe is a Kubernetes health check mechanism that determines if a container is running and responsive, triggering a restart if the probe fails.
A liveness probe is a periodic diagnostic executed by the kubelet agent on a Kubernetes node. It performs a configurable check—such as an HTTP GET request, a TCP socket connection, or a command execution inside the container—to assess if the primary application process is alive but potentially stuck or unresponsive. If the probe fails consecutively, exceeding a defined failure threshold, the kubelet kills the container and restarts it according to the pod's restartPolicy. This mechanism is a core self-healing capability in container orchestration, ensuring faulty instances are automatically recovered.
Probes are defined in a container's specification within the pod manifest. Key parameters include initialDelaySeconds (wait time before starting probes), periodSeconds (time between probes), timeoutSeconds, successThreshold, and failureThreshold. Unlike a readiness probe, which controls traffic flow, a liveness probe governs container lifecycle. It is a foundational pattern for building resilient, fault-tolerant services, acting as an automated dead man's switch for containerized processes. Misconfiguration, such as overly sensitive checks, can cause unnecessary restart loops.
Liveness Probe Types: Comparison
A comparison of the three primary mechanisms for implementing a Kubernetes Liveness Probe, detailing their operation, configuration, and trade-offs.
| Probe Type | HTTP GET | TCP Socket | Exec Command |
|---|---|---|---|
Core Mechanism | Issues an HTTP request to a specified endpoint | Attempts to open a TCP connection to a specified port | Executes a command inside the container |
Success Condition | HTTP status code between 200 and 399 | TCP connection is successfully established | Command exits with status code 0 |
Primary Use Case | Web servers, REST APIs, HTTP services | Non-HTTP services (e.g., databases, custom TCP protocols) | Custom, complex health logic not expressible via HTTP/TCP |
Configuration Complexity | Low (requires endpoint path/port) | Low (requires port number) | High (requires crafting and securing a shell command) |
Resource Overhead | Low (single HTTP request) | Very Low (port check) | Variable to High (depends on command; can be CPU/memory intensive) |
Security Consideration | Endpoint should be internal/unprivileged | Port should be internal/firewalled | High risk; command runs with container privileges; avoid shell injection |
Failure Granularity | Specific HTTP error code may be returned | Binary (connection succeeds or fails) | Custom exit code and stderr output available for debugging |
Recommended Initial Delay | 5-30 seconds | 5-30 seconds | 30+ seconds (if command is resource-heavy) |
Frequently Asked Questions
A liveness probe is a core Kubernetes health check mechanism that determines if a container is running and responsive. This section answers common technical questions about its configuration, behavior, and role in resilient system design.
A liveness probe is a Kubernetes health check that determines if a container is running and responsive, triggering a restart if the probe fails. It is a diagnostic mechanism that periodically executes a test—such as an HTTP GET request, a TCP socket connection, or a command execution inside the container—to assess the application's basic operational state. Unlike a readiness probe, which gates traffic, a liveness probe's sole purpose is to identify and recover from a "dead" or unresponsive container by forcing the kubelet to kill and restart the Pod. This automated recovery is a foundational pattern for building self-healing, resilient applications within a container orchestration platform.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A liveness probe is a foundational health check in container orchestration. The following terms represent related concepts in the broader ecosystem of automated diagnostics and resilient system design.
Circuit Breaker
A software design pattern that detects failures and prevents an application from repeatedly trying to execute an operation that's likely to fail. It acts as a proxy for operations that can fail, moving between Closed, Open, and Half-Open states. This pattern provides stability and prevents cascading failures in distributed systems, complementing health checks by offering fast failure rather than waiting for a timeout.
- Closed State: Requests flow normally; failures are counted.
- Open State: Requests fail immediately without attempting the operation.
- Half-Open State: A limited number of test requests are allowed to see if the underlying fault is resolved.
Dead Man's Switch
A safety mechanism that requires a periodic signal or 'heartbeat' to confirm a system or process is operational. If the expected signal is not received within a defined timeout, the system assumes a failure and triggers a corrective action, such as a failover, shutdown, or alert. This is a broader conceptual analog to a liveness probe, often implemented at the application or infrastructure level rather than the container level.
- Mechanism: Periodic 'I am alive' signals from the monitored entity.
- Corrective Action: Executes a predefined safety procedure (e.g., restart, notify, switch to backup).
- Example: A cloud VM sending heartbeats to a monitoring service; missing heartbeats trigger an auto-scaling group replacement.
Health Endpoint
A dedicated URL (e.g., /health or /status) exposed by a service that returns a standardized HTTP status code and payload indicating its operational health. This endpoint is the target for probes from orchestrators like Kubernetes, load balancers, and monitoring tools. A robust health endpoint performs dependency checks (database, APIs) and returns detailed component status.
- Standard Response: HTTP 200 OK for healthy, 5xx for unhealthy.
- Payload: Often JSON detailing status of subcomponents (e.g.,
{"db": "ok", "cache": "degraded"}). - Implementation: Can check internal state, connection pools, and free disk space.
Watchdog Timer
A hardware or software timer that must be periodically reset by a main program to prove it is not stuck in a hang or infinite loop. If the timer expires (is not 'petted'), it triggers a system reset or a predefined recovery action. This is a low-level, time-based fault detection mechanism, analogous to a liveness probe but typically operating at the OS or firmware level to recover from catastrophic stalls.
- Implementation: Can be a hardware chip or a kernel daemon.
- Reset Action: Often called 'kicking' or 'petting' the watchdog.
- Use Case: Critical embedded systems, IoT devices, and servers where unresponsive states must be automatically cleared.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us