The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed metrics like CPU utilization or custom application metrics. It operates by querying the Kubernetes Metrics Server for resource usage or a custom metrics API for application-level data. The HPA controller continuously compares the current metric values against the target thresholds defined in its specification, calculating the desired replica count to maintain performance and efficiency. This enables reactive scaling to handle increases in load and scale down during periods of low demand, optimizing resource consumption and cost.
Glossary
Horizontal Pod Autoscaler (HPA)

What is Horizontal Pod Autoscaler (HPA)?
The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed resource utilization or custom application metrics.
For agent deployment observability, HPA is crucial for managing the computational footprint of autonomous agents. By scaling based on metrics like request latency or queue depth, it ensures agentic services remain responsive under variable workloads. Integrating HPA with a comprehensive telemetry pipeline allows for scaling decisions informed by business logic, not just infrastructure metrics. This is a key component of autoscaling strategies within Kubernetes, working alongside the Vertical Pod Autoscaler (VPA) and Cluster Autoscaler to provide a full-stack, elastic infrastructure for dynamic AI workloads.
Key Features of HPA
The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed utilization metrics, ensuring applications can scale to meet demand and reduce costs during low traffic.
Metric-Driven Scaling
HPA scales workloads based on observed metrics, not schedules or predictions. It continuously queries the Kubernetes Metrics API to collect data points (e.g., average CPU utilization across all pods) and compares them against target values you define.
- Default Metric: CPU utilization (millicores).
- Custom & External Metrics: Can scale based on memory usage, application-specific metrics (e.g., requests per second from a Prometheus adapter), or even cloud service metrics.
- Target Value: You set a target, such as 70% CPU utilization. HPA's algorithm calculates the desired replica count to meet that target.
The Scaling Algorithm
HPA uses a deterministic algorithm to calculate the desired number of replicas.
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
- The calculation is performed per metric. When multiple metrics are specified, HPA calculates a replica count for each and chooses the highest value.
- It incorporates stabilization windows to prevent rapid, flapping scale events. The
--horizontal-pod-autoscaler-downscale-stabilizationflag (default 5 minutes) dictates how long HPA must observe low demand before scaling down. - Tolerance: A configurable tolerance band (default 0.1) prevents trivial rescaling. A change of less than 10% in the desired replica count may be ignored.
Resource Metrics vs. Custom Metrics
HPA supports distinct metric types that require different API setups.
- Resource Metrics: Core Kubernetes resource usage (CPU, memory). These use the metrics.k8s.io API, typically provided by the Metrics Server.
- Custom Metrics: Application-level metrics (e.g., HTTP requests queue depth, messages processed per second). These use the custom.metrics.k8s.io API, provided by an adapter like the Prometheus Adapter.
- External Metrics: Metrics from outside the Kubernetes cluster (e.g., a queue length from AWS SQS or Google Pub/Sub). These use the external.metrics.k8s.io API.
Custom and external metrics enable scaling based on business logic, not just infrastructure saturation.
Behavior Configuration (K8s 1.18+)
The behavior field in the HPA spec allows fine-grained control over scaling speed and stabilization.
- Scale-Up/Down Policies: Control the rate of scaling. You can specify the number of pods that can be added/removed per minute (
podsPerSecondis also an option) and a periodSeconds window. - Stabilization Windows: Define separate windows for scale-up and scale-down. A longer downscale stabilization window protects against removing pods too quickly after a brief traffic spike.
- Select Policies: Specify if the highest or lowest calculated replica change from a policy should be used.
Example: A policy allowing a max increase of 100% of current replicas every 60 seconds for rapid scale-up, but a max decrease of 10% every 5 minutes for cautious scale-down.
Integration with Cluster Autoscaler
HPA and the Cluster Autoscaler (CA) work together for full-stack elasticity.
- HPA adjusts the number of pods (application layer) based on demand.
- If scaling up creates pods that cannot be scheduled due to insufficient node resources (CPU/memory), the Cluster Autoscaler detects this.
- CA automatically provisions new nodes in the cloud provider (infrastructure layer) to accommodate the pending pods.
- Conversely, when HPA scales down pods and nodes become underutilized, CA can safely drain and remove those nodes to reduce infrastructure costs.
This creates a fully automated, cost-efficient scaling pipeline from application to infrastructure.
Practical Considerations & Limits
Effective HPA usage requires awareness of its operational boundaries.
- Pod Readiness: HPA only scales pods that are ready. Pods stuck in
ContainerCreatingor failing readiness probes are not counted, which can cause excessive scaling. - Resource Requests Must Be Set: For CPU-based scaling, your pod spec must define
resources.requests.cpu. HPA cannot calculate utilization percentage without a request value as the denominator. - Cool-Down Delays: After a scaling action, HPA waits a cooldown period (
--horizontal-pod-autoscaler-upscale-delay, default 3 min down, 0 sec up in K8s >=1.24) before evaluating metrics again to allow metrics to stabilize. - Minimum and Maximum Replicas: Always set sensible
minReplicasandmaxReplicasto prevent runaway scaling (to zero or to exhaust cluster resources).
HPA vs. Other Scaling Methods
This table compares the Horizontal Pod Autoscaler (HPA) with other primary scaling methods available within Kubernetes and cloud platforms, highlighting key operational characteristics for agent deployment observability.
| Scaling Dimension | Horizontal Pod Autoscaler (HPA) | Vertical Pod Autoscaler (VPA) | Cluster Autoscaler | Manual Scaling |
|---|---|---|---|---|
Primary Scaling Axis | Number of Pod replicas | Pod resource requests/limits (CPU, Memory) | Number of cluster Nodes | Number of Pod replicas or Node resources |
Automation Level | Fully automatic based on metrics | Fully automatic based on metrics | Fully automatic based on pending pods | Manual via kubectl or dashboard |
Key Trigger Metrics | CPU utilization, memory utilization, custom/external metrics | Historic CPU/Memory usage recommendations | Pending pods due to insufficient resources | Human observation and decision |
Typical Scaling Latency | 15-30 seconds (default metrics poll interval) | Requires pod restart; slower (minutes) | Node provisioning time; 1-5 minutes | Immediate upon command execution |
Impact on Running Pods | Minimal; creates/destroys pods, may cause brief traffic shift | High; often requires pod restart/eviction | High; involves node addition/removal | Minimal for pod count, high for node changes |
Stateful Application Support | Limited; requires careful design for stateful workloads | Possible, but pod restart is disruptive | Yes, nodes provide resources for StatefulSets | Yes, with manual planning |
Cost Optimization Focus | Right-sizing replica count for variable load | Right-sizing individual pod resource allocation | Right-sizing node count for cluster demand | None; relies on static over-provisioning |
Integration with Custom Metrics | Yes, via Custom Metrics API or External Metrics API | No, focuses on core resource metrics | No, reacts to Kubernetes scheduler events | No |
Use Case in Agent Observability | Dynamic scaling of stateless inference or processing agents based on QPS/latency | Optimizing resource allocation for individual, long-running agent pods | Ensuring cluster capacity for agent deployment during peak loads | Baseline configuration or emergency intervention |
Frequently Asked Questions
Essential questions about the Kubernetes Horizontal Pod Autoscaler (HPA), a core controller for automating resource scaling in deployments and replica sets based on observed metrics.
The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It operates in a continuous control loop:
- The HPA controller queries the Kubernetes Metrics API (for core resource metrics) or a custom metrics API (for application-specific metrics) at a default interval of 15 seconds.
- It compares the current metric value (e.g., average CPU utilization across all pods) against the target value defined in the HPA specification.
- Using the formula
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)], it calculates the desired number of replicas. - It updates the
.spec.replicasfield of the target workload object (e.g., Deployment), instructing the workload controller (e.g., ReplicaSet) to scale the pods up or down to meet the desired state.
The HPA is a foundational component of autoscaling strategies, enabling applications to handle variable load efficiently and cost-effectively.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The Horizontal Pod Autoscaler (HPA) operates within a broader Kubernetes ecosystem of scaling and resource management controllers. Understanding these related concepts is essential for designing robust, self-healing deployments.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us