Inferensys

Glossary

Horizontal Pod Autoscaler (HPA)

A Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed CPU utilization or other custom metrics.
DevOps engineer deploying LLM to production on laptop, Kubernetes dashboards visible, late night deployment session.
KUBERNETES AUTOSCALING

What is Horizontal Pod Autoscaler (HPA)?

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed resource utilization or custom application metrics.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed metrics like CPU utilization or custom application metrics. It operates by querying the Kubernetes Metrics Server for resource usage or a custom metrics API for application-level data. The HPA controller continuously compares the current metric values against the target thresholds defined in its specification, calculating the desired replica count to maintain performance and efficiency. This enables reactive scaling to handle increases in load and scale down during periods of low demand, optimizing resource consumption and cost.

For agent deployment observability, HPA is crucial for managing the computational footprint of autonomous agents. By scaling based on metrics like request latency or queue depth, it ensures agentic services remain responsive under variable workloads. Integrating HPA with a comprehensive telemetry pipeline allows for scaling decisions informed by business logic, not just infrastructure metrics. This is a key component of autoscaling strategies within Kubernetes, working alongside the Vertical Pod Autoscaler (VPA) and Cluster Autoscaler to provide a full-stack, elastic infrastructure for dynamic AI workloads.

KUBERNETES AUTOSCALING

Key Features of HPA

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set based on observed utilization metrics, ensuring applications can scale to meet demand and reduce costs during low traffic.

01

Metric-Driven Scaling

HPA scales workloads based on observed metrics, not schedules or predictions. It continuously queries the Kubernetes Metrics API to collect data points (e.g., average CPU utilization across all pods) and compares them against target values you define.

  • Default Metric: CPU utilization (millicores).
  • Custom & External Metrics: Can scale based on memory usage, application-specific metrics (e.g., requests per second from a Prometheus adapter), or even cloud service metrics.
  • Target Value: You set a target, such as 70% CPU utilization. HPA's algorithm calculates the desired replica count to meet that target.
02

The Scaling Algorithm

HPA uses a deterministic algorithm to calculate the desired number of replicas.

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

  • The calculation is performed per metric. When multiple metrics are specified, HPA calculates a replica count for each and chooses the highest value.
  • It incorporates stabilization windows to prevent rapid, flapping scale events. The --horizontal-pod-autoscaler-downscale-stabilization flag (default 5 minutes) dictates how long HPA must observe low demand before scaling down.
  • Tolerance: A configurable tolerance band (default 0.1) prevents trivial rescaling. A change of less than 10% in the desired replica count may be ignored.
03

Resource Metrics vs. Custom Metrics

HPA supports distinct metric types that require different API setups.

  • Resource Metrics: Core Kubernetes resource usage (CPU, memory). These use the metrics.k8s.io API, typically provided by the Metrics Server.
  • Custom Metrics: Application-level metrics (e.g., HTTP requests queue depth, messages processed per second). These use the custom.metrics.k8s.io API, provided by an adapter like the Prometheus Adapter.
  • External Metrics: Metrics from outside the Kubernetes cluster (e.g., a queue length from AWS SQS or Google Pub/Sub). These use the external.metrics.k8s.io API.

Custom and external metrics enable scaling based on business logic, not just infrastructure saturation.

04

Behavior Configuration (K8s 1.18+)

The behavior field in the HPA spec allows fine-grained control over scaling speed and stabilization.

  • Scale-Up/Down Policies: Control the rate of scaling. You can specify the number of pods that can be added/removed per minute (podsPerSecond is also an option) and a periodSeconds window.
  • Stabilization Windows: Define separate windows for scale-up and scale-down. A longer downscale stabilization window protects against removing pods too quickly after a brief traffic spike.
  • Select Policies: Specify if the highest or lowest calculated replica change from a policy should be used.

Example: A policy allowing a max increase of 100% of current replicas every 60 seconds for rapid scale-up, but a max decrease of 10% every 5 minutes for cautious scale-down.

05

Integration with Cluster Autoscaler

HPA and the Cluster Autoscaler (CA) work together for full-stack elasticity.

  1. HPA adjusts the number of pods (application layer) based on demand.
  2. If scaling up creates pods that cannot be scheduled due to insufficient node resources (CPU/memory), the Cluster Autoscaler detects this.
  3. CA automatically provisions new nodes in the cloud provider (infrastructure layer) to accommodate the pending pods.
  4. Conversely, when HPA scales down pods and nodes become underutilized, CA can safely drain and remove those nodes to reduce infrastructure costs.

This creates a fully automated, cost-efficient scaling pipeline from application to infrastructure.

06

Practical Considerations & Limits

Effective HPA usage requires awareness of its operational boundaries.

  • Pod Readiness: HPA only scales pods that are ready. Pods stuck in ContainerCreating or failing readiness probes are not counted, which can cause excessive scaling.
  • Resource Requests Must Be Set: For CPU-based scaling, your pod spec must define resources.requests.cpu. HPA cannot calculate utilization percentage without a request value as the denominator.
  • Cool-Down Delays: After a scaling action, HPA waits a cooldown period (--horizontal-pod-autoscaler-upscale-delay, default 3 min down, 0 sec up in K8s >=1.24) before evaluating metrics again to allow metrics to stabilize.
  • Minimum and Maximum Replicas: Always set sensible minReplicas and maxReplicas to prevent runaway scaling (to zero or to exhaust cluster resources).
COMPARISON

HPA vs. Other Scaling Methods

This table compares the Horizontal Pod Autoscaler (HPA) with other primary scaling methods available within Kubernetes and cloud platforms, highlighting key operational characteristics for agent deployment observability.

Scaling DimensionHorizontal Pod Autoscaler (HPA)Vertical Pod Autoscaler (VPA)Cluster AutoscalerManual Scaling

Primary Scaling Axis

Number of Pod replicas

Pod resource requests/limits (CPU, Memory)

Number of cluster Nodes

Number of Pod replicas or Node resources

Automation Level

Fully automatic based on metrics

Fully automatic based on metrics

Fully automatic based on pending pods

Manual via kubectl or dashboard

Key Trigger Metrics

CPU utilization, memory utilization, custom/external metrics

Historic CPU/Memory usage recommendations

Pending pods due to insufficient resources

Human observation and decision

Typical Scaling Latency

15-30 seconds (default metrics poll interval)

Requires pod restart; slower (minutes)

Node provisioning time; 1-5 minutes

Immediate upon command execution

Impact on Running Pods

Minimal; creates/destroys pods, may cause brief traffic shift

High; often requires pod restart/eviction

High; involves node addition/removal

Minimal for pod count, high for node changes

Stateful Application Support

Limited; requires careful design for stateful workloads

Possible, but pod restart is disruptive

Yes, nodes provide resources for StatefulSets

Yes, with manual planning

Cost Optimization Focus

Right-sizing replica count for variable load

Right-sizing individual pod resource allocation

Right-sizing node count for cluster demand

None; relies on static over-provisioning

Integration with Custom Metrics

Yes, via Custom Metrics API or External Metrics API

No, focuses on core resource metrics

No, reacts to Kubernetes scheduler events

No

Use Case in Agent Observability

Dynamic scaling of stateless inference or processing agents based on QPS/latency

Optimizing resource allocation for individual, long-running agent pods

Ensuring cluster capacity for agent deployment during peak loads

Baseline configuration or emergency intervention

HORIZONTAL POD AUTOSCALER

Frequently Asked Questions

Essential questions about the Kubernetes Horizontal Pod Autoscaler (HPA), a core controller for automating resource scaling in deployments and replica sets based on observed metrics.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It operates in a continuous control loop:

  1. The HPA controller queries the Kubernetes Metrics API (for core resource metrics) or a custom metrics API (for application-specific metrics) at a default interval of 15 seconds.
  2. It compares the current metric value (e.g., average CPU utilization across all pods) against the target value defined in the HPA specification.
  3. Using the formula desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)], it calculates the desired number of replicas.
  4. It updates the .spec.replicas field of the target workload object (e.g., Deployment), instructing the workload controller (e.g., ReplicaSet) to scale the pods up or down to meet the desired state.

The HPA is a foundational component of autoscaling strategies, enabling applications to handle variable load efficiently and cost-effectively.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.