Inferensys

Glossary

Pod Disruption Budget (PDB)

A Pod Disruption Budget (PDB) is a Kubernetes API object that limits the number of pods of a replicated application that can be down simultaneously from voluntary disruptions, ensuring high availability during cluster operations.
ML engineer managing model training cluster on laptop, GPU utilization visible, technical deep learning setup.
KUBERNETES FAULT TOLERANCE

What is Pod Disruption Budget (PDB)?

A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods for a replicated application that must remain available during voluntary disruptions, ensuring application resilience.

A Pod Disruption Budget (PDB) is a declarative Kubernetes policy that constrains the number of pods in a replicated application that can be simultaneously unavailable due to voluntary disruptions. These disruptions include user-initiated actions like draining a node for maintenance, scaling down a deployment, or updating a pod template. The PDB defines thresholds using minAvailable or maxUnavailable fields, which the cluster scheduler respects to prevent excessive downtime. It is a core component of fault-tolerant agent design, ensuring autonomous services maintain availability during planned infrastructure changes.

The PDB operates within the reconciliation loop of the Kubernetes control plane, interacting with the disruption controller. It does not protect against involuntary disruptions like hardware failures, which are managed by other mechanisms like health probes and replica sets. By guaranteeing a baseline of available replicas, a PDB enables graceful degradation and supports deployment strategies like canary deployments and rolling updates. This makes it essential for implementing self-healing software systems where high availability is a non-functional requirement.

KUBERNETES FAULT TOLERANCE

Key Features of a Pod Disruption Budget

A Pod Disruption Budget (PDB) is a declarative Kubernetes API object that constrains voluntary disruptions to maintain application availability. It defines the minimum number or percentage of pods that must remain available during operations like node drains or cluster upgrades.

01

Voluntary vs. Involuntary Disruptions

A PDB only governs voluntary disruptions, which are initiated by cluster administrators or automated systems with the intent of maintaining the cluster. These include:

  • Node drain operations for maintenance or scaling.
  • Deployment updates that trigger pod eviction.
  • Cluster autoscaler removing an underutilized node.

It does not protect against involuntary disruptions, such as:

  • Hardware failure of a node.
  • The Linux kernel Out-of-Memory (OOM) Killer terminating a pod.
  • A cloud provider deleting the underlying VM. For involuntary failures, high availability is achieved through replica counts and other cluster-level resilience patterns.
02

MinAvailable and MaxUnavailable

A PDB specifies constraints using one of two mutually exclusive fields, defining the disruption budget in absolute numbers or percentages:

  • minAvailable: Specifies the minimum number (e.g., 2) or percentage (e.g., "50%") of pods from the controlled set that must remain available during a disruption. This is useful for ensuring a baseline service capacity.

  • maxUnavailable: Specifies the maximum number (e.g., 1) or percentage (e.g., "25%") of pods from the controlled set that can be unavailable simultaneously. This is often used for applications where any downtime is critical.

For a deployment with 4 replicas, a maxUnavailable: 1 ensures at least 3 pods are always running, while a minAvailable: "75%" achieves the same constraint.

03

Selector-Based Pod Targeting

A PDB does not reference pods directly. Instead, it uses a label selector (spec.selector) to dynamically match a set of pods. This selector must match the labels on the pods managed by a Deployment, StatefulSet, or other workload controller.

Example: A PDB with selector.matchLabels: app=api-server will apply to all pods with the label app: api-server. This decouples the availability policy from the specific pod instances, allowing the PDB to work seamlessly as pods are created and deleted by the workload controller during normal operations and scaling events.

04

Integration with Cluster Operations

The Kubernetes cluster autoscaler and the kubectl drain command explicitly respect PDBs. When a user requests a node drain, the system:

  1. Checks all PDBs to see if evicting pods from the node would violate any constraints.
  2. If allowed, it evicts pods gracefully, respecting the pod's terminationGracePeriodSeconds.
  3. If a violation would occur, the drain command blocks and fails by default. The --disable-eviction flag or a PodDisruptionBudget with maxUnavailable: 0 can create an intentional operational gate, requiring manual intervention to proceed.
05

Health and Status Conditions

A PDB's status field provides real-time observability into its enforcement:

  • currentHealthy: The number of currently observed healthy pods matching the selector.
  • desiredHealthy: The minimum number of pods required to be healthy, derived from minAvailable or maxUnavailable.
  • disruptionsAllowed: The number of pods that can currently be disrupted without violating the budget. This is the key operational metric.
  • expectedPods: The total number of pods matched by the selector.

Monitoring disruptionsAllowed dropping to 0 is critical, as it signals that voluntary disruptions are blocked, potentially halting automated cluster operations.

06

Strategic Use with Other Patterns

PDBs are a core component of a layered resilience strategy and interact with several sibling fault-tolerance concepts:

  • Circuit Breaker Pattern: While a circuit breaker protects a service from downstream failures, a PDB protects the service's own instances from being removed. They are complementary controls.
  • Graceful Degradation: A PDB enforcing minAvailable: "50%" ensures the service degrades gracefully under operational pressure, maintaining partial capacity.
  • Canary Deployment: During a canary rollout, a PDB on the stable deployment can prevent too many of its pods from being replaced simultaneously, ensuring the canary handles only a defined fraction of traffic.
  • Health Probes: PDBs consider a pod "healthy" if it is in the Ready condition, which is determined by the pod's readiness probe. Properly configured probes are essential for PDB accuracy.
SELF-HEALING SOFTWARE SYSTEMS

How Pod Disruption Budgets Work

A Pod Disruption Budget (PDB) is a Kubernetes API object that limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions, ensuring high availability.

A Pod Disruption Budget (PDB) is a declarative Kubernetes policy that safeguards application availability during voluntary disruptions like node drains or cluster upgrades. It defines the minimum number of available pods (minAvailable) or the maximum number of unavailable pods (maxUnavailable) a deployment or stateful set can tolerate. The cluster scheduler enforces this policy, blocking voluntary evictions that would violate the budget and cause an application-level outage. This mechanism is a core fault-tolerant agent design pattern for autonomous, self-healing systems.

PDBs operate within the reconciliation loop of the Kubernetes control plane, which continuously observes pod states. They only govern voluntary disruptions initiated by cluster administrators; they do not protect against involuntary disruptions like hardware failure. For comprehensive resilience, PDBs are used alongside patterns like circuit breakers and health probes. This creates a layered defense, allowing the system to manage planned changes gracefully while other components handle unexpected failures, embodying the principles of graceful degradation and recursive error correction.

FAULT TOLERANCE MECHANISMS

PDB vs. Other Availability Controls

A comparison of the Pod Disruption Budget with other common patterns and controls used to ensure application availability and resilience in Kubernetes and distributed systems.

Feature / MechanismPod Disruption Budget (PDB)Health ProbesCircuit Breaker Pattern

Primary Purpose

Govern voluntary disruptions (e.g., node drains, updates)

Detect and remediate unhealthy application instances (pods)

Prevent cascading failures from downstream service outages

Control Scope

Application-level (across pod replicas)

Pod/container-level

Service-to-service communication level

Trigger Condition

User-initiated eviction (voluntary disruption)

Container fails liveness/readiness check (involuntary failure)

Downstream service failures exceed a defined threshold

Automatic Action

Blocks eviction if it would violate budget

Restarts container (liveness) or removes from service endpoints (readiness)

Fails fast or uses fallback logic; stops sending requests

Configuration Method

Kubernetes API object (YAML manifest)

Pod spec fields (livenessProbe, readinessProbe)

Code/library configuration in service client (e.g., resilience4j, Istio)

Operates During

Planned maintenance operations

Continuous runtime operation

Runtime operation, during dependent service failure

Key Metric Enforced

minAvailable or maxUnavailable pod count

Probe success/failure rate and timing

Failure rate, slow call rate, request volume threshold

Integration with Orchestrator

Native to Kubernetes scheduler and eviction API

Native to kubelet and service controller

Implemented in application code or service mesh (e.g., Istio, Linkerd)

SELF-HEALING SOFTWARE SYSTEMS

Common PDB Use Cases & Examples

A Pod Disruption Budget (PDB) is a critical Kubernetes construct for managing voluntary disruptions. It ensures application availability by defining the minimum number of healthy pods that must remain available during operations like node maintenance or cluster upgrades.

01

Safeguarding Critical Stateful Workloads

PDBs are essential for stateful applications like databases (e.g., PostgreSQL, Cassandra) and message queues (e.g., Kafka). These systems often have complex replication and leader-follower topologies where losing multiple pods simultaneously can cause data loss or extended unavailability.

  • Example: A 3-pod Kafka cluster with a PDB of maxUnavailable: 1. This ensures the cluster maintains a quorum (2 out of 3 brokers) during voluntary disruptions, preventing a complete loss of message production or consumption.
  • Key Consideration: The PDB minAvailable or maxUnavailable value must align with the application's own replication factor and quorum requirements.
02

Controlled Node Drain & Cluster Maintenance

During planned node maintenance (e.g., kernel updates, hardware refresh), the Kubernetes scheduler uses PDBs to safely drain nodes. The drain command will evict pods, but it respects PDB constraints, cordoning the node and waiting if evicting pods would violate the budget.

  • Process: 1) Administrator initiates kubectl drain <node-name>. 2) The API server checks PDBs for pods on the node. 3) If eviction violates a PDB, the drain is blocked until the condition is resolved (e.g., by manually scaling up the deployment).
  • Result: This allows for zero-downtime maintenance windows for applications that define appropriate PDBs, transforming a risky operation into a predictable, automated procedure.
03

Coordinating Rolling Updates & Deployments

PDBs work in tandem with deployment strategies to prevent self-inflicted outages. During a rolling update of a Deployment, Kubernetes creates new pods and terminates old ones. A PDB ensures the rollout does not proceed too aggressively.

  • Scenario: A Deployment with 10 replicas and a PDB of minAvailable: 8. The rolling update will never have fewer than 8 ready pods serving traffic. The controller will terminate an old pod, wait for its replacement to become Ready, and only then proceed to terminate the next one.
  • Integration: This is a form of automated rollback strategy. If the new pods fail their health checks, the update stalls, preserving the minimum available pods from the old version.
04

Enforcing High Availability for Microservices

For stateless microservices, a PDB defines the service's availability SLA to the cluster. It acts as a declarative guardrail against excessive pod churn, ensuring a baseline level of capacity for handling incoming requests.

  • Example: A frontend API service with 5 replicas might have a PDB of maxUnavailable: 20%. This guarantees at least 4 pods are always available, allowing the cluster autoscaler or other controllers to voluntarily disrupt only one pod at a time.
  • Benefit: This prevents thundering herd scenarios where multiple pods are simultaneously evicted from a node under pressure, which could cascade latency spikes or errors through the dependency graph.
05

Interaction with Cluster Autoscaler

The Cluster Autoscaler uses PDBs when deciding to scale down a node. A node is only considered for removal if all pods running on it can be safely moved elsewhere without violating any PDB.

  • Cost Optimization: This allows for aggressive cluster scaling policies while protecting application availability. Without PDBs, the autoscaler might remove a node hosting multiple pods from the same application, causing an unintended outage.
  • Constraint: Pods with restrictive PDBs (e.g., minAvailable set to a high percentage of total replicas) can block scale-down operations, potentially increasing cloud costs. The PDB value is a direct trade-off between resilience and resource efficiency.
06

Limitations & Voluntary vs. Involuntary Disruptions

It is critical to understand that a PDB only governs voluntary disruptions. These are disruptions initiated by the cluster controllers (human or automated). It provides no protection against involuntary disruptions.

  • Voluntary Disruptions: Node drain, eviction by cluster autoscaler, deletion of a pod by a Deployment controller.
  • Involuntary Disruptions: Hardware failure, kernel panic, node running out of resources (triggering the OOM Killer), or a provider deleting the underlying VM.
  • Implication: A PDB is not a substitute for overall high-availability design. It must be complemented by:
    • Adequate pod replicas spread across failure domains (using Pod Anti-Affinity).
    • Robust health probes (liveness/readiness).
    • Resource requests and limits to prevent OOM kills.
POD DISRUPTION BUDGET

Frequently Asked Questions

A Pod Disruption Budget (PDB) is a critical Kubernetes API object for ensuring application availability during voluntary disruptions. These FAQs address its core mechanics, configuration, and role within self-healing, fault-tolerant architectures.

A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods for a given application that must remain available during voluntary disruptions. It works by acting as a constraint on the Kubernetes cluster's management operations, such as node drains for maintenance or cluster autoscaler scale-down events. When an operation that would evict pods is requested, the Kubernetes Disruption Controller evaluates the request against all applicable PDBs. If evicting the pods would violate the PDB's minAvailable or maxUnavailable rules, the eviction is temporarily blocked, protecting the application's availability. This mechanism is a declarative form of fault isolation, ensuring that self-healing systems maintain their service level objectives (SLOs) even during planned administrative actions.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.