Glossary

Agent Quality of Service (QoS)

Agent Quality of Service (QoS) is a classification (Guaranteed, Burstable, BestEffort) assigned by an orchestrator like Kubernetes based on resource requests and limits, influencing scheduling priority and eviction order under resource pressure.

Get in touch Learn more

Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.

AGENT LIFECYCLE MANAGEMENT

What is Agent Quality of Service (QoS)?

Agent Quality of Service (QoS) is a classification system used by orchestration platforms to manage the scheduling priority and resource guarantees for autonomous agents, ensuring predictable performance and system stability.

Agent Quality of Service (QoS) is a classification—Guaranteed, Burstable, or BestEffort—assigned by an orchestrator like Kubernetes based on an agent's declared resource requests and limits. This classification directly influences the agent's scheduling priority and its order of eviction when the system experiences resource pressure, providing a critical mechanism for ensuring that high-priority agents receive the compute resources they need to function reliably.

In practice, a Guaranteed QoS class is assigned to agents with equal CPU and memory requests and limits, guaranteeing their resources will not be throttled. Burstable agents have requests set lower than limits, allowing them to use excess resources when available. BestEffort agents, with no requests or limits, have the lowest scheduling priority and are the first to be terminated under memory pressure, making this a fundamental tool for platform engineers managing agent lifecycle and cluster stability.

KUBERNETES RESOURCE MODEL

The Three QoS Classes

In Kubernetes, a Pod's Quality of Service (QoS) class is automatically assigned based on its resource requests and limits. This classification directly impacts scheduling priority and the order of eviction when node resources are exhausted.

Guaranteed QoS

The highest priority class, assigned to Pods where every container has both a CPU request and a CPU limit defined, and a memory request and a memory limit defined, and these values are equal for each resource.

Scheduling: The scheduler uses the request as the minimum resource guarantee.
Eviction: These Pods are the last to be killed under memory pressure.
Use Case: Critical, latency-sensitive agents where predictable performance is essential.

Burstable QoS

The default class for most Pods, assigned if a Pod does not meet the criteria for Guaranteed QoS but has at least one container with a CPU or memory request or limit set.

Scheduling: Uses the defined request for scheduling.
Eviction: Killed after all BestEffort Pods but before any Guaranteed Pods when the node is under memory pressure.
Use Case: General-purpose agents that need a baseline of resources but can temporarily use more (burst) if available.

BestEffort QoS

The lowest priority class, assigned to Pods where no container has any CPU or memory requests or limits specified.

Scheduling: Has no resource guarantees; scheduled onto nodes with allocatable space.
Eviction: These Pods are the first to be terminated when the node experiences memory pressure.
Use Case: Non-critical, batch-oriented, or test agents where interruption is acceptable.

How QoS Affects Scheduling

The scheduler uses the resource request (not the limit) to determine if a node has enough capacity to run a Pod.

A Guaranteed Pod with a 1 CPU request requires a node with at least 1 allocatable CPU.
A BestEffort Pod, with no request, can be scheduled anywhere but competes for resources without protection.
Nodes use a Quality of Service (QoS) cgroup hierarchy to enforce these classes, prioritizing CPU time for higher-class Pods.

Eviction Order Under Memory Pressure

When a node runs out of memory, the kubelet triggers pod eviction to reclaim resources. The order is deterministic:

BestEffort Pods are killed first.
Burstable Pods are killed next, starting with those consuming the most resources relative to their request.
Guaranteed Pods are killed last, only if they are using more than their request or if all other Pods are terminated.

This ensures critical agent workloads have the highest survivability.

Practical Configuration Examples

Guaranteed QoS Pod Spec:

yaml
resources:
  requests:
    memory: "256Mi"
    cpu: "500m"
  limits:
    memory: "256Mi"
    cpu: "500m"

Burstable QoS Pod Spec:

yaml
resources:
  requests:
    memory: "128Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"  # Limit > Request
    cpu: "1"

BestEffort QoS Pod Spec:

yaml
# No resources section defined

AGENT LIFECYCLE MANAGEMENT

How Agent QoS Works in Orchestration

Agent Quality of Service (QoS) is a classification system used by orchestrators to manage resource allocation and scheduling priority for agent workloads, directly impacting system stability and performance under contention.

Agent Quality of Service (QoS) is a resource management classification—typically Guaranteed, Burstable, or BestEffort—assigned by an orchestrator like Kubernetes based on an agent's declared CPU and memory requests and limits. This classification determines the agent's scheduling priority and the order in which it may be evicted from a node when the system experiences resource pressure, ensuring critical agents maintain operation. The orchestrator uses these QoS classes to make deterministic decisions about pod placement and preemption.

In practice, an agent with Guaranteed QoS (equal requests and limits) receives the highest priority and is last to be evicted, making it suitable for mission-critical, latency-sensitive tasks. Agents with Burstable (limits higher than requests) or BestEffort (no requests/limits) classifications are scheduled opportunistically and can be terminated first during resource contention, ideal for batch or background processing. This tiered system allows platform engineers to architect cost-effective, resilient multi-agent systems by aligning an agent's business importance with its resource guarantees.

AGENT QUALITY OF SERVICE (QOS)

Frequently Asked Questions

Agent Quality of Service (QoS) is a critical concept in multi-agent orchestration that determines resource allocation, scheduling priority, and system stability. These FAQs address its mechanisms, implementation, and impact on agent lifecycle management.

Agent Quality of Service (QoS) is a classification system used by an orchestrator, such as Kubernetes, to manage the priority and resource guarantees of agent pods based on their declared CPU and memory requests and limits. It works by assigning pods to one of three classes—Guaranteed, Burstable, or BestEffort—which directly influences the scheduler's placement decisions and the kubelet's eviction behavior when a node is under resource pressure. For example, a pod with both CPU and memory limits equal to its requests receives the Guaranteed QoS class, granting it the highest protection from eviction. This mechanism ensures that critical agents maintain performance while allowing less critical agents to utilize spare capacity.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

AGENT LIFECYCLE MANAGEMENT

Related Terms

Agent Quality of Service (QoS) interacts with several core orchestration concepts that govern resource allocation, scheduling, and system resilience. These related terms define the operational parameters and guarantees within a multi-agent system.

Agent Resource Quota

An Agent Resource Quota is a policy constraint that limits the aggregate amount of compute resources (CPU, memory) or object counts (pods, services) that a collection of agents within a namespace can consume. It works in tandem with QoS classes to enforce cluster-wide resource governance.

Enforces Budgets: Prevents a single team or application from monopolizing cluster resources.
Multi-Dimensional Limits: Can cap CPU, memory, storage, and the number of Kubernetes objects.
Namespace Scoped: Applied per namespace, allowing different quotas for development, staging, and production environments.

Quotas are evaluated during pod creation. If an agent's resource requests would exceed the namespace quota, the scheduler will reject the pod.

Agent HorizontalPodAutoscaler (HPA)

The Agent HorizontalPodAutoscaler (HPA) is a Kubernetes controller that automatically scales the number of agent pod replicas in a deployment or statefulset based on observed metrics. It interacts with QoS by scaling based on resource utilization relative to the agent's declared requests and limits.

Metric-Driven: Scales based on average CPU/memory utilization or custom application metrics (e.g., queue length).
Target Utilization: Aims for a target percentage of the resource request. A BestEffort pod with no request cannot be scaled by standard CPU/memory HPA.
QoS Implications: Scaling Guaranteed pods is highly predictable, while scaling Burstable pods requires careful tuning of targets relative to their variable resource usage.

Pod Disruption Budget (PDB)

A Pod Disruption Budget (PDB) is a Kubernetes policy that limits the number of agent pods in a voluntary disruption that can be down simultaneously, ensuring high availability during maintenance. It defines the minimum number or percentage of pods that must remain available.

Voluntary Disruptions: Includes actions initiated by cluster administrators, such as node draining for upgrades.
QoS Interaction: The orchestrator respects PDBs when evicting pods under resource pressure. A BestEffort pod with a PDB may be evicted after a Burstable pod without a PDB, as PDBs influence but do not override the QoS-based eviction order.
Key Parameters: minAvailable (e.g., "90%") or maxUnavailable (e.g., "1").

Agent Scheduling

Agent Scheduling is the process by which an orchestration system decides which compute node should run a specific agent instance. The scheduler uses the agent's QoS class, derived from its resource requests and limits, as a primary input for this decision.

Bin Packing: The scheduler attempts to place pods on nodes with sufficient allocatable resources to meet the pod's requests.
Priority for Guaranteed: Pods with Guaranteed QoS are often easier to schedule predictably because their resource usage is strictly bounded.
Node Pressure: If a node experiences resource pressure (e.g., memory exhaustion), the kubelet will evict pods, starting with BestEffort, then Burstable pods exceeding their requests, before touching Guaranteed pods.

Agent Self-Healing

Agent Self-Healing is an orchestration capability where the system automatically detects agent failures and takes corrective action, such as restarting the agent or rescheduling it. QoS influences the system's response to resource-related failures.

Health Probes: Uses liveness and readiness probes to determine agent health.
Eviction vs. Termination: A pod evicted due to memory pressure (a QoS-driven action) is different from a termination due to a failed liveness probe. The former is a resource management event, while the latter is a health failure.
Restart Policy: Defines whether a failed or evicted pod should be restarted on the same or a different node, interacting with the agent's persistent state requirements.

Agent Declarative Configuration

Agent Declarative Configuration is a practice where the desired state of an agent system is declared in version-controlled files (like YAML), and an orchestration tool ensures the actual state matches this specification. QoS is a property derived from this declared configuration.

Source of Truth: Resource requests and limits are defined in the pod spec, from which the QoS class is automatically calculated.
GitOps: This declarative spec is often managed via GitOps workflows, where changes to resource definitions in a Git repository trigger automated rollouts.
Immutable Intent: The orchestrator continuously reconciles the cluster state to match these declarations, including enforcing the resource guarantees and limits that define QoS.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Agent Quality of Service (QoS)

What is Agent Quality of Service (QoS)?

The Three QoS Classes

Guaranteed QoS

Burstable QoS

BestEffort QoS

How QoS Affects Scheduling

Eviction Order Under Memory Pressure

Practical Configuration Examples

How Agent QoS Works in Orchestration

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there