Glossary

Pod Disruption Budget (PDB)

A Pod Disruption Budget (PDB) is a Kubernetes policy that limits the number of agent pods that can be down simultaneously during voluntary disruptions like node drains or updates.

Get in touch Learn more

Developer demonstrating multi-agent tool use, agent tool selection interface on laptop, casual tech demo moment.

KUBERNETES POLICY

What is Pod Disruption Budget (PDB)?

A Pod Disruption Budget (PDB) is a Kubernetes policy that limits the number of agent pods in a voluntary disruption (like node drains or updates) that can be down simultaneously, ensuring high availability during maintenance.

A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods from a replicated application that must remain available during voluntary disruptions. It is a declarative policy that constrains actions like node drains, cluster autoscaler scale-downs, or manual pod evictions, ensuring high availability and service-level objectives (SLOs) are maintained during planned maintenance. The PDB does not protect against involuntary disruptions like hardware failures.

The policy defines two key parameters: minAvailable or maxUnavailable. Administrators apply the PDB to a set of pods using label selectors. When a disruptive operation is requested, the Kubernetes disruption controller evaluates the PDB. The operation proceeds only if it will not violate the budget, otherwise it is blocked. This mechanism is critical for agent lifecycle management in multi-agent systems, guaranteeing a quorum of operational agents during orchestrated updates or node maintenance.

POD DISRUPTION BUDGET

Key PDB Parameters and Configuration

A Pod Disruption Budget (PDB) is a Kubernetes policy object that defines the minimum availability guarantees for a set of pods during voluntary disruptions, such as node maintenance or cluster upgrades. It is configured using a few core parameters.

Core Spec: minAvailable and maxUnavailable

A PDB is defined by one of two mutually exclusive parameters in its spec:

minAvailable: Specifies the absolute number or percentage of pods from the controlled set that must remain available during a disruption. For example, minAvailable: 2 or minAvailable: "50%".
maxUnavailable: Specifies the absolute number or percentage of pods from the controlled set that can be unavailable during a disruption. For example, maxUnavailable: 1 or maxUnavailable: "25%".

These parameters are evaluated against the total number of pods matched by the selector. You must define only one of these fields.

Selector: Targeting Pods

The selector field is a label selector that determines which pods the PDB governs. It uses the same syntax as other Kubernetes selectors (e.g., matchLabels, matchExpressions).

The PDB only protects pods whose labels match this selector.
It is crucial that the selector accurately targets the pods belonging to your application's deployment, statefulset, or other controller.
A common pattern is to use the same selector as the parent workload. For example, a PDB for a deployment with app: my-agent would use selector: { matchLabels: { app: my-agent } }.

Example YAML Configuration

A typical PDB manifest for an agent deployment with 4 replicas, ensuring at least 3 are always available during voluntary disruptions:

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: agent-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: inference-agent

An equivalent configuration using maxUnavailable:

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: inference-agent

Status and Health Monitoring

Once applied, the PDB object has a status field that provides real-time information:

currentHealthy: The number of pods currently observed as healthy and ready.
desiredHealthy: The minimum number of pods required to be healthy, calculated from minAvailable or maxUnavailable.
disruptionsAllowed: The most critical field. It shows how many pods can currently be disrupted without violating the budget. This value is 0 when the system is at its disruption limit.
expectedPods: The total number of pods matched by the selector.

Platform engineers monitor disruptionsAllowed to understand the system's capacity for safe maintenance.

Interaction with Voluntary vs. Involuntary Disruptions

PDBs only govern voluntary disruptions. It is critical to understand the distinction:

Voluntary Disruptions: Actions initiated by a cluster administrator or automated process that are expected and controlled. Examples include:
- Draining a node for maintenance (kubectl drain).
- A deployment update triggering a rolling update.
- Manually deleting a pod. The orchestrator will respect the PDB, blocking the action if it would violate the budget.
Involuntary Disruptions: Unplanned failures. Examples include:
- A node hardware failure.
- A pod eviction due to the node running out of resources.
- A kernel panic. PDBs do not protect against these. Resilience here is provided by replica counts, self-healing, and anti-affinity rules.

Best Practices for Agent Systems

When configuring PDBs for agent orchestration:

Set Realistic Budgets: For a deployment with replicas: 4, maxUnavailable: 1 (25%) is common. For critical agents, use maxUnavailable: 0 or a high minAvailable percentage.
Align with Replica Count: Ensure your PDB allows the orchestrator to make progress. A PDB with minAvailable: 100% on a deployment prevents all voluntary disruptions, which can block necessary updates.
Use with Anti-Affinity: Combine PDBs with pod anti-affinity rules to ensure agent pods are spread across nodes. This prevents a single node drain from taking down multiple pods and violating the PDB.
Monitor Budget Violations: Use orchestration observability tools to alert when disruptionsAllowed is 0 for extended periods, indicating a potential operational blocker.

POD DISRUPTION BUDGET

Frequently Asked Questions

A Pod Disruption Budget (PDB) is a critical Kubernetes policy for managing the availability of agent pods during voluntary disruptions. These FAQs address its core mechanisms, configuration, and role in multi-agent system orchestration.

A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods in an application that must remain available during voluntary disruptions. It works by placing constraints on actions initiated by cluster administrators or automated system components, such as draining a node for maintenance or updating a DaemonSet. When a disruptive operation is requested, the Kubernetes API server checks the relevant PDBs. The operation is allowed to proceed only if it will not violate the PDB's stated availability guarantees (e.g., maxUnavailable: 1). If the operation would cause too many pods to be down simultaneously, it is blocked or paced, ensuring high availability for stateful agents or critical services during planned events.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

AGENT LIFECYCLE MANAGEMENT

Related Terms

These core orchestration concepts work in concert with Pod Disruption Budgets to manage the availability, resilience, and operational lifecycle of agent pods in production.

Agent Health Check

A periodic diagnostic probe used by an orchestration system to determine if an agent is functioning correctly. A failing health check can trigger a self-healing action, such as a pod restart. This is distinct from a PDB, which governs voluntary disruptions. Key types include:

Liveness Probe: Determines if the agent container is running. Failure results in a restart.
Readiness Probe: Determines if the agent is ready to accept traffic. Failure removes the pod from service load balancers.

Agent Self-Healing

An orchestration capability where the system automatically detects and recovers from agent failures. This typically works in tandem with health checks and PDBs:

Self-healing handles involuntary disruptions (crashes, node failures) by restarting or rescheduling pods.
A Pod Disruption Budget (PDB) protects against voluntary disruptions (maintenance, updates) by limiting how many pods can be down at once. Together, they ensure high availability across both planned and unplanned downtime scenarios.

Agent Rolling Update

A deployment strategy that incrementally replaces instances of an old agent version with a new version. This is a primary use case for a Pod Disruption Budget.

The orchestrator (e.g., Kubernetes) terminates old pods and creates new ones in a controlled sequence.
The PDB acts as a guardrail, ensuring the number of unavailable pods during this process never exceeds the defined threshold (e.g., maxUnavailable: 1). This ensures zero-downtime updates while maintaining service-level agreements.

Agent Graceful Termination

The controlled shutdown process for an agent, allowing it to complete in-flight tasks and release resources. When a voluntary disruption (like a node drain) occurs, the orchestrator sends a SIGTERM signal to initiate this process.

A Pod Disruption Budget influences the timing of these terminations by limiting how many can happen concurrently.
The agent has a terminationGracePeriodSeconds to finish its work before receiving a SIGKILL. This process is critical for preventing data corruption and ensuring clean handoffs.

Node Drain / Cordon

Administrative operations that prepare a node for maintenance, directly interacting with Pod Disruption Budgets.

Cordon: Marks a node as unschedulable, preventing new pods from being placed on it.
Drain: Safely evicts all pods from a node, respecting each pod's PDB. The drain command will block if evicting a pod would violate its PDB. These commands are used during node updates, scaling down, or hardware repairs, making PDBs essential for planned cluster operations.

Agent Quality of Service (QoS)

A classification (Guaranteed, Burstable, BestEffort) assigned by an orchestrator based on resource requests and limits. While PDBs protect against voluntary disruptions, QoS influences behavior during involuntary resource pressure:

Under memory pressure, the system may evict pods to reclaim resources.
BestEffort pods (with no requests/limits) are evicted first, followed by Burstable, then Guaranteed. Understanding QoS is crucial for designing systems where PDBs alone are insufficient for overall resilience.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.