Glossary

Horizontal Pod Autoscaler (HPA)

A Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed CPU utilization or other custom metrics.

Get in touch Learn more

DevOps engineer deploying LLM to production on laptop, Kubernetes dashboards visible, late night deployment session.

KUBERNETES AUTOSCALING

What is Horizontal Pod Autoscaler (HPA)?

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed resource utilization or custom application metrics.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed metrics like CPU utilization or custom application metrics. It operates by querying the Kubernetes Metrics Server for resource usage or a custom metrics API for application-level data. The HPA controller continuously compares the current metric values against the target thresholds defined in its specification, calculating the desired replica count to maintain performance and efficiency. This enables reactive scaling to handle increases in load and scale down during periods of low demand, optimizing resource consumption and cost.

For agent deployment observability, HPA is crucial for managing the computational footprint of autonomous agents. By scaling based on metrics like request latency or queue depth, it ensures agentic services remain responsive under variable workloads. Integrating HPA with a comprehensive telemetry pipeline allows for scaling decisions informed by business logic, not just infrastructure metrics. This is a key component of autoscaling strategies within Kubernetes, working alongside the Vertical Pod Autoscaler (VPA) and Cluster Autoscaler to provide a full-stack, elastic infrastructure for dynamic AI workloads.

KUBERNETES AUTOSCALING

Key Features of HPA

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed utilization metrics, ensuring applications can scale to meet demand and reduce costs during low traffic.

Metric-Driven Scaling

HPA scales workloads based on observed metrics, not schedules or predictions. It continuously queries the Kubernetes Metrics API to collect data points (e.g., average CPU utilization across all pods) and compares them against target values you define.

Default Metric: CPU utilization (millicores).
Custom & External Metrics: Can scale based on memory usage, application-specific metrics (e.g., requests per second from a Prometheus adapter), or even cloud service metrics.
Target Value: You set a target, such as 70% CPU utilization. HPA's algorithm calculates the desired replica count to meet that target.

The Scaling Algorithm

HPA uses a deterministic algorithm to calculate the desired number of replicas.

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

The calculation is performed per metric. When multiple metrics are specified, HPA calculates a replica count for each and chooses the highest value.
It incorporates stabilization windows to prevent rapid, flapping scale events. The --horizontal-pod-autoscaler-downscale-stabilization flag (default 5 minutes) dictates how long HPA must observe low demand before scaling down.
Tolerance: A configurable tolerance band (default 0.1) prevents trivial rescaling. A change of less than 10% in the desired replica count may be ignored.

Resource Metrics vs. Custom Metrics

HPA supports distinct metric types that require different API setups.

Resource Metrics: Core Kubernetes resource usage (CPU, memory). These use the metrics.k8s.io API, typically provided by the Metrics Server.
Custom Metrics: Application-level metrics (e.g., HTTP requests queue depth, messages processed per second). These use the custom.metrics.k8s.io API, provided by an adapter like the Prometheus Adapter.
External Metrics: Metrics from outside the Kubernetes cluster (e.g., a queue length from AWS SQS or Google Pub/Sub). These use the external.metrics.k8s.io API.

Custom and external metrics enable scaling based on business logic, not just infrastructure saturation.

Behavior Configuration (K8s 1.18+)

The behavior field in the HPA spec allows fine-grained control over scaling speed and stabilization.

Scale-Up/Down Policies: Control the rate of scaling. You can specify the number of pods that can be added/removed per minute (podsPerSecond is also an option) and a periodSeconds window.
Stabilization Windows: Define separate windows for scale-up and scale-down. A longer downscale stabilization window protects against removing pods too quickly after a brief traffic spike.
Select Policies: Specify if the highest or lowest calculated replica change from a policy should be used.

Example: A policy allowing a max increase of 100% of current replicas every 60 seconds for rapid scale-up, but a max decrease of 10% every 5 minutes for cautious scale-down.

Integration with Cluster Autoscaler

HPA and the Cluster Autoscaler (CA) work together for full-stack elasticity.

HPA adjusts the number of pods (application layer) based on demand.
If scaling up creates pods that cannot be scheduled due to insufficient node resources (CPU/memory), the Cluster Autoscaler detects this.
CA automatically provisions new nodes in the cloud provider (infrastructure layer) to accommodate the pending pods.
Conversely, when HPA scales down pods and nodes become underutilized, CA can safely drain and remove those nodes to reduce infrastructure costs.

This creates a fully automated, cost-efficient scaling pipeline from application to infrastructure.

Practical Considerations & Limits

Effective HPA usage requires awareness of its operational boundaries.

Pod Readiness: HPA only scales pods that are ready. Pods stuck in ContainerCreating or failing readiness probes are not counted, which can cause excessive scaling.
Resource Requests Must Be Set: For CPU-based scaling, your pod spec must define resources.requests.cpu. HPA cannot calculate utilization percentage without a request value as the denominator.
Cool-Down Delays: After a scaling action, HPA waits a cooldown period (--horizontal-pod-autoscaler-upscale-delay, default 3 min down, 0 sec up in K8s >=1.24) before evaluating metrics again to allow metrics to stabilize.
Minimum and Maximum Replicas: Always set sensible minReplicas and maxReplicas to prevent runaway scaling (to zero or to exhaust cluster resources).

COMPARISON

HPA vs. Other Scaling Methods

This table compares the Horizontal Pod Autoscaler (HPA) with other primary scaling methods available within Kubernetes and cloud platforms, highlighting key operational characteristics for agent deployment observability.

Scaling Dimension	Horizontal Pod Autoscaler (HPA)	Vertical Pod Autoscaler (VPA)	Cluster Autoscaler	Manual Scaling
Primary Scaling Axis	Number of Pod replicas	Pod resource requests/limits (CPU, Memory)	Number of cluster Nodes	Number of Pod replicas or Node resources
Automation Level	Fully automatic based on metrics	Fully automatic based on metrics	Fully automatic based on pending pods	Manual via kubectl or dashboard
Key Trigger Metrics	CPU utilization, memory utilization, custom/external metrics	Historic CPU/Memory usage recommendations	Pending pods due to insufficient resources	Human observation and decision
Typical Scaling Latency	15-30 seconds (default metrics poll interval)	Requires pod restart; slower (minutes)	Node provisioning time; 1-5 minutes	Immediate upon command execution
Impact on Running Pods	Minimal; creates/destroys pods, may cause brief traffic shift	High; often requires pod restart/eviction	High; involves node addition/removal	Minimal for pod count, high for node changes
Stateful Application Support	Limited; requires careful design for stateful workloads	Possible, but pod restart is disruptive	Yes, nodes provide resources for StatefulSets	Yes, with manual planning
Cost Optimization Focus	Right-sizing replica count for variable load	Right-sizing individual pod resource allocation	Right-sizing node count for cluster demand	None; relies on static over-provisioning
Integration with Custom Metrics	Yes, via Custom Metrics API or External Metrics API	No, focuses on core resource metrics	No, reacts to Kubernetes scheduler events	No
Use Case in Agent Observability	Dynamic scaling of stateless inference or processing agents based on QPS/latency	Optimizing resource allocation for individual, long-running agent pods	Ensuring cluster capacity for agent deployment during peak loads	Baseline configuration or emergency intervention

HORIZONTAL POD AUTOSCALER

Frequently Asked Questions

Essential questions about the Kubernetes Horizontal Pod Autoscaler (HPA), a core controller for automating resource scaling in deployments and replica sets based on observed metrics.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It operates in a continuous control loop:

The HPA controller queries the Kubernetes Metrics API (for core resource metrics) or a custom metrics API (for application-specific metrics) at a default interval of 15 seconds.
It compares the current metric value (e.g., average CPU utilization across all pods) against the target value defined in the HPA specification.
Using the formula desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)], it calculates the desired number of replicas.
It updates the .spec.replicas field of the target workload object (e.g., Deployment), instructing the workload controller (e.g., ReplicaSet) to scale the pods up or down to meet the desired state.

The HPA is a foundational component of autoscaling strategies, enabling applications to handle variable load efficiently and cost-effectively.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Horizontal Pod Autoscaler (HPA)

What is Horizontal Pod Autoscaler (HPA)?