Inferensys

Glossary

Agent Quality of Service (QoS)

Agent Quality of Service (QoS) is a classification (Guaranteed, Burstable, BestEffort) assigned by an orchestrator like Kubernetes based on resource requests and limits, influencing scheduling priority and eviction order under resource pressure.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
AGENT LIFECYCLE MANAGEMENT

What is Agent Quality of Service (QoS)?

Agent Quality of Service (QoS) is a classification system used by orchestration platforms to manage the scheduling priority and resource guarantees for autonomous agents, ensuring predictable performance and system stability.

Agent Quality of Service (QoS) is a classification—Guaranteed, Burstable, or BestEffort—assigned by an orchestrator like Kubernetes based on an agent's declared resource requests and limits. This classification directly influences the agent's scheduling priority and its order of eviction when the system experiences resource pressure, providing a critical mechanism for ensuring that high-priority agents receive the compute resources they need to function reliably.

In practice, a Guaranteed QoS class is assigned to agents with equal CPU and memory requests and limits, guaranteeing their resources will not be throttled. Burstable agents have requests set lower than limits, allowing them to use excess resources when available. BestEffort agents, with no requests or limits, have the lowest scheduling priority and are the first to be terminated under memory pressure, making this a fundamental tool for platform engineers managing agent lifecycle and cluster stability.

KUBERNETES RESOURCE MODEL

The Three QoS Classes

In Kubernetes, a Pod's Quality of Service (QoS) class is automatically assigned based on its resource requests and limits. This classification directly impacts scheduling priority and the order of eviction when node resources are exhausted.

01

Guaranteed QoS

The highest priority class, assigned to Pods where every container has both a CPU request and a CPU limit defined, and a memory request and a memory limit defined, and these values are equal for each resource.

  • Scheduling: The scheduler uses the request as the minimum resource guarantee.
  • Eviction: These Pods are the last to be killed under memory pressure.
  • Use Case: Critical, latency-sensitive agents where predictable performance is essential.
02

Burstable QoS

The default class for most Pods, assigned if a Pod does not meet the criteria for Guaranteed QoS but has at least one container with a CPU or memory request or limit set.

  • Scheduling: Uses the defined request for scheduling.
  • Eviction: Killed after all BestEffort Pods but before any Guaranteed Pods when the node is under memory pressure.
  • Use Case: General-purpose agents that need a baseline of resources but can temporarily use more (burst) if available.
03

BestEffort QoS

The lowest priority class, assigned to Pods where no container has any CPU or memory requests or limits specified.

  • Scheduling: Has no resource guarantees; scheduled onto nodes with allocatable space.
  • Eviction: These Pods are the first to be terminated when the node experiences memory pressure.
  • Use Case: Non-critical, batch-oriented, or test agents where interruption is acceptable.
04

How QoS Affects Scheduling

The scheduler uses the resource request (not the limit) to determine if a node has enough capacity to run a Pod.

  • A Guaranteed Pod with a 1 CPU request requires a node with at least 1 allocatable CPU.
  • A BestEffort Pod, with no request, can be scheduled anywhere but competes for resources without protection.
  • Nodes use a Quality of Service (QoS) cgroup hierarchy to enforce these classes, prioritizing CPU time for higher-class Pods.
05

Eviction Order Under Memory Pressure

When a node runs out of memory, the kubelet triggers pod eviction to reclaim resources. The order is deterministic:

  1. BestEffort Pods are killed first.
  2. Burstable Pods are killed next, starting with those consuming the most resources relative to their request.
  3. Guaranteed Pods are killed last, only if they are using more than their request or if all other Pods are terminated.

This ensures critical agent workloads have the highest survivability.

06

Practical Configuration Examples

Guaranteed QoS Pod Spec:

yaml
resources:
  requests:
    memory: "256Mi"
    cpu: "500m"
  limits:
    memory: "256Mi"
    cpu: "500m"

Burstable QoS Pod Spec:

yaml
resources:
  requests:
    memory: "128Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"  # Limit > Request
    cpu: "1"

BestEffort QoS Pod Spec:

yaml
# No resources section defined
AGENT LIFECYCLE MANAGEMENT

How Agent QoS Works in Orchestration

Agent Quality of Service (QoS) is a classification system used by orchestrators to manage resource allocation and scheduling priority for agent workloads, directly impacting system stability and performance under contention.

Agent Quality of Service (QoS) is a resource management classification—typically Guaranteed, Burstable, or BestEffort—assigned by an orchestrator like Kubernetes based on an agent's declared CPU and memory requests and limits. This classification determines the agent's scheduling priority and the order in which it may be evicted from a node when the system experiences resource pressure, ensuring critical agents maintain operation. The orchestrator uses these QoS classes to make deterministic decisions about pod placement and preemption.

In practice, an agent with Guaranteed QoS (equal requests and limits) receives the highest priority and is last to be evicted, making it suitable for mission-critical, latency-sensitive tasks. Agents with Burstable (limits higher than requests) or BestEffort (no requests/limits) classifications are scheduled opportunistically and can be terminated first during resource contention, ideal for batch or background processing. This tiered system allows platform engineers to architect cost-effective, resilient multi-agent systems by aligning an agent's business importance with its resource guarantees.

AGENT QUALITY OF SERVICE (QOS)

Frequently Asked Questions

Agent Quality of Service (QoS) is a critical concept in multi-agent orchestration that determines resource allocation, scheduling priority, and system stability. These FAQs address its mechanisms, implementation, and impact on agent lifecycle management.

Agent Quality of Service (QoS) is a classification system used by an orchestrator, such as Kubernetes, to manage the priority and resource guarantees of agent pods based on their declared CPU and memory requests and limits. It works by assigning pods to one of three classes—Guaranteed, Burstable, or BestEffort—which directly influences the scheduler's placement decisions and the kubelet's eviction behavior when a node is under resource pressure. For example, a pod with both CPU and memory limits equal to its requests receives the Guaranteed QoS class, granting it the highest protection from eviction. This mechanism ensures that critical agents maintain performance while allowing less critical agents to utilize spare capacity.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.