Inferensys

Glossary

Liveness Probe

A Kubernetes health check that determines if a container is running and responsive, triggering a restart if the probe fails.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
AGENTIC HEALTH CHECKS

What is a Liveness Probe?

A core mechanism for ensuring containerized applications remain responsive and can self-heal from runtime failures.

A liveness probe is a Kubernetes health check mechanism that determines if a container is running and responsive, triggering an automatic restart if the probe fails. It is a fundamental component of self-healing software systems, allowing a container orchestrator to detect and remediate a hung or dead process without human intervention. Probes are typically configured as HTTP requests, TCP socket checks, or command executions within the container.

The probe operates by periodically executing a diagnostic test against a defined health endpoint. If consecutive failures exceed a configured threshold, the kubelet terminates the container and restarts it according to the pod's restart policy. This mechanism is distinct from a readiness probe, which controls traffic flow, as a liveness probe governs the container's lifecycle to maintain application availability within the broader recursive error correction framework.

KUBERNETES HEALTH CHECK

Key Features of a Liveness Probe

A Liveness Probe is a Kubernetes health check mechanism that determines if a container is running and responsive. It is a core component of resilient, self-healing application deployment, automatically restarting containers that fail the probe.

01

Core Purpose: Detect Unresponsive Containers

The primary function of a liveness probe is to detect when an application inside a container has entered a broken state—such as a deadlock, infinite loop, or internal crash—where it is still running but cannot make progress or serve requests. Unlike a readiness probe, which checks if a container is ready to serve, a liveness probe checks if it should be restarted. A failed probe triggers the kubelet to kill the container, and the Pod's restartPolicy (usually Always) initiates a restart, aiming to restore service automatically.

02

Probe Types & Configuration

Liveness probes can be configured using one of three handlers, defined in the container's spec within the Pod manifest:

  • HTTP GET Probe: The kubelet sends an HTTP GET request to a specified path and port. A success is any HTTP status code between 200 and 399. This is ideal for web services and APIs.
  • TCP Socket Probe: The kubelet attempts to open a TCP connection to a specified port. Success is established if a connection can be made. Used for non-HTTP services like databases or custom TCP protocols.
  • Exec Probe: The kubelet executes a specified command inside the container. The probe succeeds if the command exits with status code 0. This allows for custom, application-specific health logic.

Key configuration parameters include initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, and failureThreshold.

03

Integration with Pod Lifecycle

The liveness probe operates within the broader Pod lifecycle. It typically starts after an optional initialDelaySeconds, allowing the application time to bootstrap. Once active, it runs periodically based on periodSeconds. A single failure does not immediately restart the container; the probe must fail failureThreshold consecutive times. This prevents unnecessary restarts from transient issues. Upon consecutive failures, the kubelet kills the container. The Pod's restartPolicy then governs the restart. If restarts continue rapidly (controlled by Kubernetes back-off logic), the Pod may enter a CrashLoopBackOff state.

04

Distinction from Readiness & Startup Probes

It is critical to distinguish liveness from other Kubernetes health checks:

  • vs. Readiness Probe: A readiness probe determines if a container is ready to accept traffic. A failed readiness probe removes the Pod's IP from Service endpoints but does not restart the container. Use it for slow startups or temporary dependencies.
  • vs. Startup Probe: Used for legacy applications with long initialization times. It disables liveness and readiness checks until it succeeds once. After that, liveness probes take over for the remainder of the container's lifecycle.

A common pattern: Use a startup probe for initial boot, a readiness probe for traffic management, and a liveness probe for crash recovery.

05

Design Best Practices & Anti-Patterns

Effective liveness probe design is crucial for system stability.

Best Practices:

  • The check should be lightweight and fast, with a low timeoutSeconds.
  • The endpoint or command should be internal and not depend on external dependencies (e.g., databases, downstream APIs).
  • Use a dedicated, low-privilege health endpoint for HTTP probes.
  • Set initialDelaySeconds appropriately to avoid killing slow-starting apps.

Anti-Patterns to Avoid:

  • Leaky Abstractions: A probe that fails due to a downstream database outage could cause unnecessary restarts of otherwise healthy application containers.
  • Overly Sensitive Probes: Setting a low failureThreshold or short periodSeconds can cause restart storms.
  • Heavy Computational Logic: An exec probe that runs a complex script can consume significant CPU, affecting application performance.
06

Role in Self-Healing Systems

The liveness probe is a foundational reactive mechanism for self-healing software. It enables an application to automatically recover from certain internal software faults without human operator intervention, increasing overall system availability. This aligns with the Recursive Error Correction pillar by providing a basic, automated corrective action (restart) upon detecting a failure state. For more complex autonomous agents, liveness probes act as a circuit breaker at the container level, preventing a single faulty agent process from stalling an entire system. They are a key primitive in building fault-tolerant and resilient distributed systems where manual recovery is impractical at scale.

AGENTIC HEALTH CHECKS

How a Liveness Probe Works

A liveness probe is a Kubernetes health check mechanism that determines if a container is running and responsive, triggering a restart if the probe fails.

A liveness probe is a periodic diagnostic executed by the kubelet agent on a Kubernetes node. It performs a configurable check—such as an HTTP GET request, a TCP socket connection, or a command execution inside the container—to assess if the primary application process is alive but potentially stuck or unresponsive. If the probe fails consecutively, exceeding a defined failure threshold, the kubelet kills the container and restarts it according to the pod's restartPolicy. This mechanism is a core self-healing capability in container orchestration, ensuring faulty instances are automatically recovered.

Probes are defined in a container's specification within the pod manifest. Key parameters include initialDelaySeconds (wait time before starting probes), periodSeconds (time between probes), timeoutSeconds, successThreshold, and failureThreshold. Unlike a readiness probe, which controls traffic flow, a liveness probe governs container lifecycle. It is a foundational pattern for building resilient, fault-tolerant services, acting as an automated dead man's switch for containerized processes. Misconfiguration, such as overly sensitive checks, can cause unnecessary restart loops.

KUBERNETES

Liveness Probe Types: Comparison

A comparison of the three primary mechanisms for implementing a Kubernetes Liveness Probe, detailing their operation, configuration, and trade-offs.

Probe TypeHTTP GETTCP SocketExec Command

Core Mechanism

Issues an HTTP request to a specified endpoint

Attempts to open a TCP connection to a specified port

Executes a command inside the container

Success Condition

HTTP status code between 200 and 399

TCP connection is successfully established

Command exits with status code 0

Primary Use Case

Web servers, REST APIs, HTTP services

Non-HTTP services (e.g., databases, custom TCP protocols)

Custom, complex health logic not expressible via HTTP/TCP

Configuration Complexity

Low (requires endpoint path/port)

Low (requires port number)

High (requires crafting and securing a shell command)

Resource Overhead

Low (single HTTP request)

Very Low (port check)

Variable to High (depends on command; can be CPU/memory intensive)

Security Consideration

Endpoint should be internal/unprivileged

Port should be internal/firewalled

High risk; command runs with container privileges; avoid shell injection

Failure Granularity

Specific HTTP error code may be returned

Binary (connection succeeds or fails)

Custom exit code and stderr output available for debugging

Recommended Initial Delay

5-30 seconds

5-30 seconds

30+ seconds (if command is resource-heavy)

KUBERNETES HEALTH CHECKS

Frequently Asked Questions

A liveness probe is a core Kubernetes health check mechanism that determines if a container is running and responsive. This section answers common technical questions about its configuration, behavior, and role in resilient system design.

A liveness probe is a Kubernetes health check that determines if a container is running and responsive, triggering a restart if the probe fails. It is a diagnostic mechanism that periodically executes a test—such as an HTTP GET request, a TCP socket connection, or a command execution inside the container—to assess the application's basic operational state. Unlike a readiness probe, which gates traffic, a liveness probe's sole purpose is to identify and recover from a "dead" or unresponsive container by forcing the kubelet to kill and restart the Pod. This automated recovery is a foundational pattern for building self-healing, resilient applications within a container orchestration platform.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.