Agent Quality of Service (QoS) is a classification—Guaranteed, Burstable, or BestEffort—assigned by an orchestrator like Kubernetes based on an agent's declared resource requests and limits. This classification directly influences the agent's scheduling priority and its order of eviction when the system experiences resource pressure, providing a critical mechanism for ensuring that high-priority agents receive the compute resources they need to function reliably.
Glossary
Agent Quality of Service (QoS)

What is Agent Quality of Service (QoS)?
Agent Quality of Service (QoS) is a classification system used by orchestration platforms to manage the scheduling priority and resource guarantees for autonomous agents, ensuring predictable performance and system stability.
In practice, a Guaranteed QoS class is assigned to agents with equal CPU and memory requests and limits, guaranteeing their resources will not be throttled. Burstable agents have requests set lower than limits, allowing them to use excess resources when available. BestEffort agents, with no requests or limits, have the lowest scheduling priority and are the first to be terminated under memory pressure, making this a fundamental tool for platform engineers managing agent lifecycle and cluster stability.
The Three QoS Classes
In Kubernetes, a Pod's Quality of Service (QoS) class is automatically assigned based on its resource requests and limits. This classification directly impacts scheduling priority and the order of eviction when node resources are exhausted.
Guaranteed QoS
The highest priority class, assigned to Pods where every container has both a CPU request and a CPU limit defined, and a memory request and a memory limit defined, and these values are equal for each resource.
- Scheduling: The scheduler uses the request as the minimum resource guarantee.
- Eviction: These Pods are the last to be killed under memory pressure.
- Use Case: Critical, latency-sensitive agents where predictable performance is essential.
Burstable QoS
The default class for most Pods, assigned if a Pod does not meet the criteria for Guaranteed QoS but has at least one container with a CPU or memory request or limit set.
- Scheduling: Uses the defined request for scheduling.
- Eviction: Killed after all BestEffort Pods but before any Guaranteed Pods when the node is under memory pressure.
- Use Case: General-purpose agents that need a baseline of resources but can temporarily use more (burst) if available.
BestEffort QoS
The lowest priority class, assigned to Pods where no container has any CPU or memory requests or limits specified.
- Scheduling: Has no resource guarantees; scheduled onto nodes with allocatable space.
- Eviction: These Pods are the first to be terminated when the node experiences memory pressure.
- Use Case: Non-critical, batch-oriented, or test agents where interruption is acceptable.
How QoS Affects Scheduling
The scheduler uses the resource request (not the limit) to determine if a node has enough capacity to run a Pod.
- A Guaranteed Pod with a 1 CPU request requires a node with at least 1 allocatable CPU.
- A BestEffort Pod, with no request, can be scheduled anywhere but competes for resources without protection.
- Nodes use a Quality of Service (QoS) cgroup hierarchy to enforce these classes, prioritizing CPU time for higher-class Pods.
Eviction Order Under Memory Pressure
When a node runs out of memory, the kubelet triggers pod eviction to reclaim resources. The order is deterministic:
- BestEffort Pods are killed first.
- Burstable Pods are killed next, starting with those consuming the most resources relative to their request.
- Guaranteed Pods are killed last, only if they are using more than their request or if all other Pods are terminated.
This ensures critical agent workloads have the highest survivability.
Practical Configuration Examples
Guaranteed QoS Pod Spec:
yamlresources: requests: memory: "256Mi" cpu: "500m" limits: memory: "256Mi" cpu: "500m"
Burstable QoS Pod Spec:
yamlresources: requests: memory: "128Mi" cpu: "250m" limits: memory: "512Mi" # Limit > Request cpu: "1"
BestEffort QoS Pod Spec:
yaml# No resources section defined
How Agent QoS Works in Orchestration
Agent Quality of Service (QoS) is a classification system used by orchestrators to manage resource allocation and scheduling priority for agent workloads, directly impacting system stability and performance under contention.
Agent Quality of Service (QoS) is a resource management classification—typically Guaranteed, Burstable, or BestEffort—assigned by an orchestrator like Kubernetes based on an agent's declared CPU and memory requests and limits. This classification determines the agent's scheduling priority and the order in which it may be evicted from a node when the system experiences resource pressure, ensuring critical agents maintain operation. The orchestrator uses these QoS classes to make deterministic decisions about pod placement and preemption.
In practice, an agent with Guaranteed QoS (equal requests and limits) receives the highest priority and is last to be evicted, making it suitable for mission-critical, latency-sensitive tasks. Agents with Burstable (limits higher than requests) or BestEffort (no requests/limits) classifications are scheduled opportunistically and can be terminated first during resource contention, ideal for batch or background processing. This tiered system allows platform engineers to architect cost-effective, resilient multi-agent systems by aligning an agent's business importance with its resource guarantees.
Frequently Asked Questions
Agent Quality of Service (QoS) is a critical concept in multi-agent orchestration that determines resource allocation, scheduling priority, and system stability. These FAQs address its mechanisms, implementation, and impact on agent lifecycle management.
Agent Quality of Service (QoS) is a classification system used by an orchestrator, such as Kubernetes, to manage the priority and resource guarantees of agent pods based on their declared CPU and memory requests and limits. It works by assigning pods to one of three classes—Guaranteed, Burstable, or BestEffort—which directly influences the scheduler's placement decisions and the kubelet's eviction behavior when a node is under resource pressure. For example, a pod with both CPU and memory limits equal to its requests receives the Guaranteed QoS class, granting it the highest protection from eviction. This mechanism ensures that critical agents maintain performance while allowing less critical agents to utilize spare capacity.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Agent Quality of Service (QoS) interacts with several core orchestration concepts that govern resource allocation, scheduling, and system resilience. These related terms define the operational parameters and guarantees within a multi-agent system.
Agent Resource Quota
An Agent Resource Quota is a policy constraint that limits the aggregate amount of compute resources (CPU, memory) or object counts (pods, services) that a collection of agents within a namespace can consume. It works in tandem with QoS classes to enforce cluster-wide resource governance.
- Enforces Budgets: Prevents a single team or application from monopolizing cluster resources.
- Multi-Dimensional Limits: Can cap CPU, memory, storage, and the number of Kubernetes objects.
- Namespace Scoped: Applied per namespace, allowing different quotas for development, staging, and production environments.
Quotas are evaluated during pod creation. If an agent's resource requests would exceed the namespace quota, the scheduler will reject the pod.
Agent HorizontalPodAutoscaler (HPA)
The Agent HorizontalPodAutoscaler (HPA) is a Kubernetes controller that automatically scales the number of agent pod replicas in a deployment or statefulset based on observed metrics. It interacts with QoS by scaling based on resource utilization relative to the agent's declared requests and limits.
- Metric-Driven: Scales based on average CPU/memory utilization or custom application metrics (e.g., queue length).
- Target Utilization: Aims for a target percentage of the resource request. A
BestEffortpod with no request cannot be scaled by standard CPU/memory HPA. - QoS Implications: Scaling
Guaranteedpods is highly predictable, while scalingBurstablepods requires careful tuning of targets relative to their variable resource usage.
Pod Disruption Budget (PDB)
A Pod Disruption Budget (PDB) is a Kubernetes policy that limits the number of agent pods in a voluntary disruption that can be down simultaneously, ensuring high availability during maintenance. It defines the minimum number or percentage of pods that must remain available.
- Voluntary Disruptions: Includes actions initiated by cluster administrators, such as node draining for upgrades.
- QoS Interaction: The orchestrator respects PDBs when evicting pods under resource pressure. A
BestEffortpod with a PDB may be evicted after aBurstablepod without a PDB, as PDBs influence but do not override the QoS-based eviction order. - Key Parameters:
minAvailable(e.g., "90%") ormaxUnavailable(e.g., "1").
Agent Scheduling
Agent Scheduling is the process by which an orchestration system decides which compute node should run a specific agent instance. The scheduler uses the agent's QoS class, derived from its resource requests and limits, as a primary input for this decision.
- Bin Packing: The scheduler attempts to place pods on nodes with sufficient allocatable resources to meet the pod's requests.
- Priority for Guaranteed: Pods with
GuaranteedQoS are often easier to schedule predictably because their resource usage is strictly bounded. - Node Pressure: If a node experiences resource pressure (e.g., memory exhaustion), the kubelet will evict pods, starting with
BestEffort, thenBurstablepods exceeding their requests, before touchingGuaranteedpods.
Agent Self-Healing
Agent Self-Healing is an orchestration capability where the system automatically detects agent failures and takes corrective action, such as restarting the agent or rescheduling it. QoS influences the system's response to resource-related failures.
- Health Probes: Uses liveness and readiness probes to determine agent health.
- Eviction vs. Termination: A pod evicted due to memory pressure (a QoS-driven action) is different from a termination due to a failed liveness probe. The former is a resource management event, while the latter is a health failure.
- Restart Policy: Defines whether a failed or evicted pod should be restarted on the same or a different node, interacting with the agent's persistent state requirements.
Agent Declarative Configuration
Agent Declarative Configuration is a practice where the desired state of an agent system is declared in version-controlled files (like YAML), and an orchestration tool ensures the actual state matches this specification. QoS is a property derived from this declared configuration.
- Source of Truth: Resource
requestsandlimitsare defined in the pod spec, from which the QoS class is automatically calculated. - GitOps: This declarative spec is often managed via GitOps workflows, where changes to resource definitions in a Git repository trigger automated rollouts.
- Immutable Intent: The orchestrator continuously reconciles the cluster state to match these declarations, including enforcing the resource guarantees and limits that define QoS.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us