Inferensys

Glossary

Pod Disruption Budget (PDB)

A Pod Disruption Budget (PDB) is a Kubernetes policy that limits the number of agent pods that can be down simultaneously during voluntary disruptions like node drains or updates.
Developer demonstrating multi-agent tool use, agent tool selection interface on laptop, casual tech demo moment.
KUBERNETES POLICY

What is Pod Disruption Budget (PDB)?

A Pod Disruption Budget (PDB) is a Kubernetes policy that limits the number of agent pods in a voluntary disruption (like node drains or updates) that can be down simultaneously, ensuring high availability during maintenance.

A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods from a replicated application that must remain available during voluntary disruptions. It is a declarative policy that constrains actions like node drains, cluster autoscaler scale-downs, or manual pod evictions, ensuring high availability and service-level objectives (SLOs) are maintained during planned maintenance. The PDB does not protect against involuntary disruptions like hardware failures.

The policy defines two key parameters: minAvailable or maxUnavailable. Administrators apply the PDB to a set of pods using label selectors. When a disruptive operation is requested, the Kubernetes disruption controller evaluates the PDB. The operation proceeds only if it will not violate the budget, otherwise it is blocked. This mechanism is critical for agent lifecycle management in multi-agent systems, guaranteeing a quorum of operational agents during orchestrated updates or node maintenance.

POD DISRUPTION BUDGET

Key PDB Parameters and Configuration

A Pod Disruption Budget (PDB) is a Kubernetes policy object that defines the minimum availability guarantees for a set of pods during voluntary disruptions, such as node maintenance or cluster upgrades. It is configured using a few core parameters.

01

Core Spec: minAvailable and maxUnavailable

A PDB is defined by one of two mutually exclusive parameters in its spec:

  • minAvailable: Specifies the absolute number or percentage of pods from the controlled set that must remain available during a disruption. For example, minAvailable: 2 or minAvailable: "50%".
  • maxUnavailable: Specifies the absolute number or percentage of pods from the controlled set that can be unavailable during a disruption. For example, maxUnavailable: 1 or maxUnavailable: "25%".

These parameters are evaluated against the total number of pods matched by the selector. You must define only one of these fields.

02

Selector: Targeting Pods

The selector field is a label selector that determines which pods the PDB governs. It uses the same syntax as other Kubernetes selectors (e.g., matchLabels, matchExpressions).

  • The PDB only protects pods whose labels match this selector.
  • It is crucial that the selector accurately targets the pods belonging to your application's deployment, statefulset, or other controller.
  • A common pattern is to use the same selector as the parent workload. For example, a PDB for a deployment with app: my-agent would use selector: { matchLabels: { app: my-agent } }.
03

Example YAML Configuration

A typical PDB manifest for an agent deployment with 4 replicas, ensuring at least 3 are always available during voluntary disruptions:

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: agent-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: inference-agent

An equivalent configuration using maxUnavailable:

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: inference-agent
04

Status and Health Monitoring

Once applied, the PDB object has a status field that provides real-time information:

  • currentHealthy: The number of pods currently observed as healthy and ready.
  • desiredHealthy: The minimum number of pods required to be healthy, calculated from minAvailable or maxUnavailable.
  • disruptionsAllowed: The most critical field. It shows how many pods can currently be disrupted without violating the budget. This value is 0 when the system is at its disruption limit.
  • expectedPods: The total number of pods matched by the selector.

Platform engineers monitor disruptionsAllowed to understand the system's capacity for safe maintenance.

05

Interaction with Voluntary vs. Involuntary Disruptions

PDBs only govern voluntary disruptions. It is critical to understand the distinction:

  • Voluntary Disruptions: Actions initiated by a cluster administrator or automated process that are expected and controlled. Examples include:

    • Draining a node for maintenance (kubectl drain).
    • A deployment update triggering a rolling update.
    • Manually deleting a pod. The orchestrator will respect the PDB, blocking the action if it would violate the budget.
  • Involuntary Disruptions: Unplanned failures. Examples include:

    • A node hardware failure.
    • A pod eviction due to the node running out of resources.
    • A kernel panic. PDBs do not protect against these. Resilience here is provided by replica counts, self-healing, and anti-affinity rules.
06

Best Practices for Agent Systems

When configuring PDBs for agent orchestration:

  • Set Realistic Budgets: For a deployment with replicas: 4, maxUnavailable: 1 (25%) is common. For critical agents, use maxUnavailable: 0 or a high minAvailable percentage.
  • Align with Replica Count: Ensure your PDB allows the orchestrator to make progress. A PDB with minAvailable: 100% on a deployment prevents all voluntary disruptions, which can block necessary updates.
  • Use with Anti-Affinity: Combine PDBs with pod anti-affinity rules to ensure agent pods are spread across nodes. This prevents a single node drain from taking down multiple pods and violating the PDB.
  • Monitor Budget Violations: Use orchestration observability tools to alert when disruptionsAllowed is 0 for extended periods, indicating a potential operational blocker.
POD DISRUPTION BUDGET

Frequently Asked Questions

A Pod Disruption Budget (PDB) is a critical Kubernetes policy for managing the availability of agent pods during voluntary disruptions. These FAQs address its core mechanisms, configuration, and role in multi-agent system orchestration.

A Pod Disruption Budget (PDB) is a Kubernetes API object that specifies the minimum number or percentage of pods in an application that must remain available during voluntary disruptions. It works by placing constraints on actions initiated by cluster administrators or automated system components, such as draining a node for maintenance or updating a DaemonSet. When a disruptive operation is requested, the Kubernetes API server checks the relevant PDBs. The operation is allowed to proceed only if it will not violate the PDB's stated availability guarantees (e.g., maxUnavailable: 1). If the operation would cause too many pods to be down simultaneously, it is blocked or paced, ensuring high availability for stateful agents or critical services during planned events.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.