A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods from a replicated application that must remain available during voluntary disruptions. It is a declarative policy that constrains actions like node drains, cluster autoscaler scale-downs, or manual pod evictions, ensuring high availability and service-level objectives (SLOs) are maintained during planned maintenance. The PDB does not protect against involuntary disruptions like hardware failures.
Glossary
Pod Disruption Budget (PDB)

What is Pod Disruption Budget (PDB)?
A Pod Disruption Budget (PDB) is a Kubernetes policy that limits the number of agent pods in a voluntary disruption (like node drains or updates) that can be down simultaneously, ensuring high availability during maintenance.
The policy defines two key parameters: minAvailable or maxUnavailable. Administrators apply the PDB to a set of pods using label selectors. When a disruptive operation is requested, the Kubernetes disruption controller evaluates the PDB. The operation proceeds only if it will not violate the budget, otherwise it is blocked. This mechanism is critical for agent lifecycle management in multi-agent systems, guaranteeing a quorum of operational agents during orchestrated updates or node maintenance.
Key PDB Parameters and Configuration
A Pod Disruption Budget (PDB) is a Kubernetes policy object that defines the minimum availability guarantees for a set of pods during voluntary disruptions, such as node maintenance or cluster upgrades. It is configured using a few core parameters.
Core Spec: minAvailable and maxUnavailable
A PDB is defined by one of two mutually exclusive parameters in its spec:
minAvailable: Specifies the absolute number or percentage of pods from the controlled set that must remain available during a disruption. For example,minAvailable: 2orminAvailable: "50%".maxUnavailable: Specifies the absolute number or percentage of pods from the controlled set that can be unavailable during a disruption. For example,maxUnavailable: 1ormaxUnavailable: "25%".
These parameters are evaluated against the total number of pods matched by the selector. You must define only one of these fields.
Selector: Targeting Pods
The selector field is a label selector that determines which pods the PDB governs. It uses the same syntax as other Kubernetes selectors (e.g., matchLabels, matchExpressions).
- The PDB only protects pods whose labels match this selector.
- It is crucial that the selector accurately targets the pods belonging to your application's deployment, statefulset, or other controller.
- A common pattern is to use the same selector as the parent workload. For example, a PDB for a deployment with
app: my-agentwould useselector: { matchLabels: { app: my-agent } }.
Example YAML Configuration
A typical PDB manifest for an agent deployment with 4 replicas, ensuring at least 3 are always available during voluntary disruptions:
yamlapiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: agent-pdb spec: minAvailable: 3 selector: matchLabels: app: inference-agent
An equivalent configuration using maxUnavailable:
yamlapiVersion: policy/v1 kind: PodDisruptionBudget spec: maxUnavailable: 1 selector: matchLabels: app: inference-agent
Status and Health Monitoring
Once applied, the PDB object has a status field that provides real-time information:
currentHealthy: The number of pods currently observed as healthy and ready.desiredHealthy: The minimum number of pods required to be healthy, calculated fromminAvailableormaxUnavailable.disruptionsAllowed: The most critical field. It shows how many pods can currently be disrupted without violating the budget. This value is0when the system is at its disruption limit.expectedPods: The total number of pods matched by the selector.
Platform engineers monitor disruptionsAllowed to understand the system's capacity for safe maintenance.
Interaction with Voluntary vs. Involuntary Disruptions
PDBs only govern voluntary disruptions. It is critical to understand the distinction:
-
Voluntary Disruptions: Actions initiated by a cluster administrator or automated process that are expected and controlled. Examples include:
- Draining a node for maintenance (
kubectl drain). - A deployment update triggering a rolling update.
- Manually deleting a pod. The orchestrator will respect the PDB, blocking the action if it would violate the budget.
- Draining a node for maintenance (
-
Involuntary Disruptions: Unplanned failures. Examples include:
- A node hardware failure.
- A pod eviction due to the node running out of resources.
- A kernel panic. PDBs do not protect against these. Resilience here is provided by replica counts, self-healing, and anti-affinity rules.
Best Practices for Agent Systems
When configuring PDBs for agent orchestration:
- Set Realistic Budgets: For a deployment with
replicas: 4,maxUnavailable: 1(25%) is common. For critical agents, usemaxUnavailable: 0or a highminAvailablepercentage. - Align with Replica Count: Ensure your PDB allows the orchestrator to make progress. A PDB with
minAvailable: 100%on a deployment prevents all voluntary disruptions, which can block necessary updates. - Use with Anti-Affinity: Combine PDBs with pod anti-affinity rules to ensure agent pods are spread across nodes. This prevents a single node drain from taking down multiple pods and violating the PDB.
- Monitor Budget Violations: Use orchestration observability tools to alert when
disruptionsAllowedis0for extended periods, indicating a potential operational blocker.
Frequently Asked Questions
A Pod Disruption Budget (PDB) is a critical Kubernetes policy for managing the availability of agent pods during voluntary disruptions. These FAQs address its core mechanisms, configuration, and role in multi-agent system orchestration.
A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods in an application that must remain available during voluntary disruptions. It works by placing constraints on actions initiated by cluster administrators or automated system components, such as draining a node for maintenance or updating a DaemonSet. When a disruptive operation is requested, the Kubernetes API server checks the relevant PDBs. The operation is allowed to proceed only if it will not violate the PDB's stated availability guarantees (e.g., maxUnavailable: 1). If the operation would cause too many pods to be down simultaneously, it is blocked or paced, ensuring high availability for stateful agents or critical services during planned events.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
These core orchestration concepts work in concert with Pod Disruption Budgets to manage the availability, resilience, and operational lifecycle of agent pods in production.
Agent Health Check
A periodic diagnostic probe used by an orchestration system to determine if an agent is functioning correctly. A failing health check can trigger a self-healing action, such as a pod restart. This is distinct from a PDB, which governs voluntary disruptions. Key types include:
- Liveness Probe: Determines if the agent container is running. Failure results in a restart.
- Readiness Probe: Determines if the agent is ready to accept traffic. Failure removes the pod from service load balancers.
Agent Self-Healing
An orchestration capability where the system automatically detects and recovers from agent failures. This typically works in tandem with health checks and PDBs:
- Self-healing handles involuntary disruptions (crashes, node failures) by restarting or rescheduling pods.
- A Pod Disruption Budget (PDB) protects against voluntary disruptions (maintenance, updates) by limiting how many pods can be down at once. Together, they ensure high availability across both planned and unplanned downtime scenarios.
Agent Rolling Update
A deployment strategy that incrementally replaces instances of an old agent version with a new version. This is a primary use case for a Pod Disruption Budget.
- The orchestrator (e.g., Kubernetes) terminates old pods and creates new ones in a controlled sequence.
- The PDB acts as a guardrail, ensuring the number of unavailable pods during this process never exceeds the defined threshold (e.g.,
maxUnavailable: 1). This ensures zero-downtime updates while maintaining service-level agreements.
Agent Graceful Termination
The controlled shutdown process for an agent, allowing it to complete in-flight tasks and release resources. When a voluntary disruption (like a node drain) occurs, the orchestrator sends a SIGTERM signal to initiate this process.
- A Pod Disruption Budget influences the timing of these terminations by limiting how many can happen concurrently.
- The agent has a terminationGracePeriodSeconds to finish its work before receiving a SIGKILL. This process is critical for preventing data corruption and ensuring clean handoffs.
Node Drain / Cordon
Administrative operations that prepare a node for maintenance, directly interacting with Pod Disruption Budgets.
- Cordon: Marks a node as unschedulable, preventing new pods from being placed on it.
- Drain: Safely evicts all pods from a node, respecting each pod's PDB. The drain command will block if evicting a pod would violate its PDB. These commands are used during node updates, scaling down, or hardware repairs, making PDBs essential for planned cluster operations.
Agent Quality of Service (QoS)
A classification (Guaranteed, Burstable, BestEffort) assigned by an orchestrator based on resource requests and limits. While PDBs protect against voluntary disruptions, QoS influences behavior during involuntary resource pressure:
- Under memory pressure, the system may evict pods to reclaim resources.
- BestEffort pods (with no requests/limits) are evicted first, followed by Burstable, then Guaranteed. Understanding QoS is crucial for designing systems where PDBs alone are insufficient for overall resilience.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us