A Pod Disruption Budget (PDB) is a declarative Kubernetes policy that constrains the number of pods in a replicated application that can be simultaneously unavailable due to voluntary disruptions. These disruptions include user-initiated actions like draining a node for maintenance, scaling down a deployment, or updating a pod template. The PDB defines thresholds using minAvailable or maxUnavailable fields, which the cluster scheduler respects to prevent excessive downtime. It is a core component of fault-tolerant agent design, ensuring autonomous services maintain availability during planned infrastructure changes.
Glossary
Pod Disruption Budget (PDB)

What is Pod Disruption Budget (PDB)?
A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods for a replicated application that must remain available during voluntary disruptions, ensuring application resilience.
The PDB operates within the reconciliation loop of the Kubernetes control plane, interacting with the disruption controller. It does not protect against involuntary disruptions like hardware failures, which are managed by other mechanisms like health probes and replica sets. By guaranteeing a baseline of available replicas, a PDB enables graceful degradation and supports deployment strategies like canary deployments and rolling updates. This makes it essential for implementing self-healing software systems where high availability is a non-functional requirement.
Key Features of a Pod Disruption Budget
A Pod Disruption Budget (PDB) is a declarative Kubernetes API object that constrains voluntary disruptions to maintain application availability. It defines the minimum number or percentage of pods that must remain available during operations like node drains or cluster upgrades.
Voluntary vs. Involuntary Disruptions
A PDB only governs voluntary disruptions, which are initiated by cluster administrators or automated systems with the intent of maintaining the cluster. These include:
- Node drain operations for maintenance or scaling.
- Deployment updates that trigger pod eviction.
- Cluster autoscaler removing an underutilized node.
It does not protect against involuntary disruptions, such as:
- Hardware failure of a node.
- The Linux kernel Out-of-Memory (OOM) Killer terminating a pod.
- A cloud provider deleting the underlying VM. For involuntary failures, high availability is achieved through replica counts and other cluster-level resilience patterns.
MinAvailable and MaxUnavailable
A PDB specifies constraints using one of two mutually exclusive fields, defining the disruption budget in absolute numbers or percentages:
-
minAvailable: Specifies the minimum number (e.g.,2) or percentage (e.g.,"50%") of pods from the controlled set that must remain available during a disruption. This is useful for ensuring a baseline service capacity. -
maxUnavailable: Specifies the maximum number (e.g.,1) or percentage (e.g.,"25%") of pods from the controlled set that can be unavailable simultaneously. This is often used for applications where any downtime is critical.
For a deployment with 4 replicas, a maxUnavailable: 1 ensures at least 3 pods are always running, while a minAvailable: "75%" achieves the same constraint.
Selector-Based Pod Targeting
A PDB does not reference pods directly. Instead, it uses a label selector (spec.selector) to dynamically match a set of pods. This selector must match the labels on the pods managed by a Deployment, StatefulSet, or other workload controller.
Example: A PDB with selector.matchLabels: app=api-server will apply to all pods with the label app: api-server. This decouples the availability policy from the specific pod instances, allowing the PDB to work seamlessly as pods are created and deleted by the workload controller during normal operations and scaling events.
Integration with Cluster Operations
The Kubernetes cluster autoscaler and the kubectl drain command explicitly respect PDBs. When a user requests a node drain, the system:
- Checks all PDBs to see if evicting pods from the node would violate any constraints.
- If allowed, it evicts pods gracefully, respecting the pod's
terminationGracePeriodSeconds. - If a violation would occur, the drain command blocks and fails by default. The
--disable-evictionflag or a PodDisruptionBudget withmaxUnavailable: 0can create an intentional operational gate, requiring manual intervention to proceed.
Health and Status Conditions
A PDB's status field provides real-time observability into its enforcement:
currentHealthy: The number of currently observed healthy pods matching the selector.desiredHealthy: The minimum number of pods required to be healthy, derived fromminAvailableormaxUnavailable.disruptionsAllowed: The number of pods that can currently be disrupted without violating the budget. This is the key operational metric.expectedPods: The total number of pods matched by the selector.
Monitoring disruptionsAllowed dropping to 0 is critical, as it signals that voluntary disruptions are blocked, potentially halting automated cluster operations.
Strategic Use with Other Patterns
PDBs are a core component of a layered resilience strategy and interact with several sibling fault-tolerance concepts:
- Circuit Breaker Pattern: While a circuit breaker protects a service from downstream failures, a PDB protects the service's own instances from being removed. They are complementary controls.
- Graceful Degradation: A PDB enforcing
minAvailable: "50%"ensures the service degrades gracefully under operational pressure, maintaining partial capacity. - Canary Deployment: During a canary rollout, a PDB on the stable deployment can prevent too many of its pods from being replaced simultaneously, ensuring the canary handles only a defined fraction of traffic.
- Health Probes: PDBs consider a pod "healthy" if it is in the Ready condition, which is determined by the pod's readiness probe. Properly configured probes are essential for PDB accuracy.
How Pod Disruption Budgets Work
A Pod Disruption Budget (PDB) is a Kubernetes API object that limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions, ensuring high availability.
A Pod Disruption Budget (PDB) is a declarative Kubernetes policy that safeguards application availability during voluntary disruptions like node drains or cluster upgrades. It defines the minimum number of available pods (minAvailable) or the maximum number of unavailable pods (maxUnavailable) a deployment or stateful set can tolerate. The cluster scheduler enforces this policy, blocking voluntary evictions that would violate the budget and cause an application-level outage. This mechanism is a core fault-tolerant agent design pattern for autonomous, self-healing systems.
PDBs operate within the reconciliation loop of the Kubernetes control plane, which continuously observes pod states. They only govern voluntary disruptions initiated by cluster administrators; they do not protect against involuntary disruptions like hardware failure. For comprehensive resilience, PDBs are used alongside patterns like circuit breakers and health probes. This creates a layered defense, allowing the system to manage planned changes gracefully while other components handle unexpected failures, embodying the principles of graceful degradation and recursive error correction.
PDB vs. Other Availability Controls
A comparison of the Pod Disruption Budget with other common patterns and controls used to ensure application availability and resilience in Kubernetes and distributed systems.
| Feature / Mechanism | Pod Disruption Budget (PDB) | Health Probes | Circuit Breaker Pattern |
|---|---|---|---|
Primary Purpose | Govern voluntary disruptions (e.g., node drains, updates) | Detect and remediate unhealthy application instances (pods) | Prevent cascading failures from downstream service outages |
Control Scope | Application-level (across pod replicas) | Pod/container-level | Service-to-service communication level |
Trigger Condition | User-initiated eviction (voluntary disruption) | Container fails liveness/readiness check (involuntary failure) | Downstream service failures exceed a defined threshold |
Automatic Action | Blocks eviction if it would violate budget | Restarts container (liveness) or removes from service endpoints (readiness) | Fails fast or uses fallback logic; stops sending requests |
Configuration Method | Kubernetes API object (YAML manifest) | Pod spec fields ( | Code/library configuration in service client (e.g., resilience4j, Istio) |
Operates During | Planned maintenance operations | Continuous runtime operation | Runtime operation, during dependent service failure |
Key Metric Enforced |
| Probe success/failure rate and timing | Failure rate, slow call rate, request volume threshold |
Integration with Orchestrator | Native to Kubernetes scheduler and eviction API | Native to kubelet and service controller | Implemented in application code or service mesh (e.g., Istio, Linkerd) |
Common PDB Use Cases & Examples
A Pod Disruption Budget (PDB) is a critical Kubernetes construct for managing voluntary disruptions. It ensures application availability by defining the minimum number of healthy pods that must remain available during operations like node maintenance or cluster upgrades.
Safeguarding Critical Stateful Workloads
PDBs are essential for stateful applications like databases (e.g., PostgreSQL, Cassandra) and message queues (e.g., Kafka). These systems often have complex replication and leader-follower topologies where losing multiple pods simultaneously can cause data loss or extended unavailability.
- Example: A 3-pod Kafka cluster with a PDB of
maxUnavailable: 1. This ensures the cluster maintains a quorum (2 out of 3 brokers) during voluntary disruptions, preventing a complete loss of message production or consumption. - Key Consideration: The PDB
minAvailableormaxUnavailablevalue must align with the application's own replication factor and quorum requirements.
Controlled Node Drain & Cluster Maintenance
During planned node maintenance (e.g., kernel updates, hardware refresh), the Kubernetes scheduler uses PDBs to safely drain nodes. The drain command will evict pods, but it respects PDB constraints, cordoning the node and waiting if evicting pods would violate the budget.
- Process: 1) Administrator initiates
kubectl drain <node-name>. 2) The API server checks PDBs for pods on the node. 3) If eviction violates a PDB, the drain is blocked until the condition is resolved (e.g., by manually scaling up the deployment). - Result: This allows for zero-downtime maintenance windows for applications that define appropriate PDBs, transforming a risky operation into a predictable, automated procedure.
Coordinating Rolling Updates & Deployments
PDBs work in tandem with deployment strategies to prevent self-inflicted outages. During a rolling update of a Deployment, Kubernetes creates new pods and terminates old ones. A PDB ensures the rollout does not proceed too aggressively.
- Scenario: A Deployment with 10 replicas and a PDB of
minAvailable: 8. The rolling update will never have fewer than 8 ready pods serving traffic. The controller will terminate an old pod, wait for its replacement to becomeReady, and only then proceed to terminate the next one. - Integration: This is a form of automated rollback strategy. If the new pods fail their health checks, the update stalls, preserving the minimum available pods from the old version.
Enforcing High Availability for Microservices
For stateless microservices, a PDB defines the service's availability SLA to the cluster. It acts as a declarative guardrail against excessive pod churn, ensuring a baseline level of capacity for handling incoming requests.
- Example: A frontend API service with 5 replicas might have a PDB of
maxUnavailable: 20%. This guarantees at least 4 pods are always available, allowing the cluster autoscaler or other controllers to voluntarily disrupt only one pod at a time. - Benefit: This prevents thundering herd scenarios where multiple pods are simultaneously evicted from a node under pressure, which could cascade latency spikes or errors through the dependency graph.
Interaction with Cluster Autoscaler
The Cluster Autoscaler uses PDBs when deciding to scale down a node. A node is only considered for removal if all pods running on it can be safely moved elsewhere without violating any PDB.
- Cost Optimization: This allows for aggressive cluster scaling policies while protecting application availability. Without PDBs, the autoscaler might remove a node hosting multiple pods from the same application, causing an unintended outage.
- Constraint: Pods with restrictive PDBs (e.g.,
minAvailableset to a high percentage of total replicas) can block scale-down operations, potentially increasing cloud costs. The PDB value is a direct trade-off between resilience and resource efficiency.
Limitations & Voluntary vs. Involuntary Disruptions
It is critical to understand that a PDB only governs voluntary disruptions. These are disruptions initiated by the cluster controllers (human or automated). It provides no protection against involuntary disruptions.
- Voluntary Disruptions: Node drain, eviction by cluster autoscaler, deletion of a pod by a Deployment controller.
- Involuntary Disruptions: Hardware failure, kernel panic, node running out of resources (triggering the OOM Killer), or a provider deleting the underlying VM.
- Implication: A PDB is not a substitute for overall high-availability design. It must be complemented by:
- Adequate pod replicas spread across failure domains (using Pod Anti-Affinity).
- Robust health probes (liveness/readiness).
- Resource requests and limits to prevent OOM kills.
Frequently Asked Questions
A Pod Disruption Budget (PDB) is a critical Kubernetes API object for ensuring application availability during voluntary disruptions. These FAQs address its core mechanics, configuration, and role within self-healing, fault-tolerant architectures.
A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods for a given application that must remain available during voluntary disruptions. It works by acting as a constraint on the Kubernetes cluster's management operations, such as node drains for maintenance or cluster autoscaler scale-down events. When an operation that would evict pods is requested, the Kubernetes Disruption Controller evaluates the request against all applicable PDBs. If evicting the pods would violate the PDB's minAvailable or maxUnavailable rules, the eviction is temporarily blocked, protecting the application's availability. This mechanism is a declarative form of fault isolation, ensuring that self-healing systems maintain their service level objectives (SLOs) even during planned administrative actions.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A Pod Disruption Budget operates within a broader ecosystem of patterns and tools designed for resilient, autonomous systems. These related concepts are fundamental for architects building fault-tolerant platforms.
Reconciliation Loop
The core control mechanism in declarative systems like Kubernetes. It continuously observes the actual state of the system, compares it to the declared desired state (e.g., a Deployment spec), and executes actions to converge the two.
- Observe: Scan the current state of pods, nodes, etc.
- Diff: Compare against the desired state in the API server.
- Act: Create, delete, or update objects to eliminate the difference.
The PDB is a constraint evaluated within this loop. The scheduler and disruption controllers respect the PDB's maxUnavailable or minAvailable rules when planning reconciliation actions that involve pod termination.
Graceful Degradation
A design philosophy where a system maintains limited functionality during partial failures, ensuring a basic level of service instead of a complete outage.
- Prioritizes Critical Paths: Core user journeys remain operational while non-essential features are temporarily disabled.
- Informs PDB Configuration: Helps determine the acceptable
maxUnavailablepercentage. For example, a read-only mode might be acceptable if 40% of pods are down, guiding a PDB setting ofmaxUnavailable: 40%.
This concept directly informs the business-level decisions that technical PDB configurations are meant to enforce.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us