The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically scales the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It operates by querying the Kubernetes Metrics Server for resource usage data, comparing it against target values defined by the user, and instructing the workload's controller to add or remove pods to maintain the desired performance level. This provides a foundational mechanism for elastic scaling to match application demand.
Glossary
Horizontal Pod Autoscaler (HPA)

What is Horizontal Pod Autoscaler (HPA)?
The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller for dynamic resource management.
For advanced use cases, HPA can scale based on custom metrics and external metrics provided by the Kubernetes Custom Metrics API. This enables scaling driven by application-specific signals like requests per second or queue length. In the context of Large Language Model Operations, HPA is critical for managing the variable inference load of LLM endpoints, ensuring pods scale out during traffic spikes and scale in during lulls to optimize cloud infrastructure costs and maintain service level objectives (SLOs) for latency and availability.
Key Features of HPA
The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set. Its primary function is to ensure application performance and optimize resource utilization by scaling based on observed metrics.
Metric-Driven Scaling
HPA scales workloads based on observed metrics, not schedules. The default and most common metric is average CPU utilization. It can also scale based on average memory utilization and, critically, custom and external metrics via the Kubernetes Metrics API. This allows scaling on application-specific KPIs like requests per second, queue length, or any business metric exposed by a custom adapter.
- Core Metrics: CPU, Memory (via resource metrics API).
- Custom/External Metrics: Application-specific metrics (e.g.,
http_requests_per_second) via the custom metrics API. - ContainerResource Metrics: Scale based on the resource usage of a specific container within a pod.
Declarative Configuration
Scaling behavior is defined declaratively through a Kubernetes HorizontalPodAutoscaler resource manifest. Engineers specify the target metric (e.g., 70% CPU utilization), the minimum and maximum number of replicas, and optional stabilization windows. The HPA controller continuously works to reconcile the actual state (current metric values) with this declared desired state.
Key configuration fields include:
scaleTargetRef: The Deployment, StatefulSet, or other scalable resource to manage.metrics: List of target metrics and their desired values.minReplicas/maxReplicas: The bounds of the scaling range.behavior: Configures scaling speed and stabilization to prevent flapping.
Cool-Down & Stabilization
To prevent rapid, unnecessary scaling oscillations (thrashing), HPA implements cool-down delays. After a scaling operation, it waits a period (default 5 minutes for upscaling, 3 minutes for downscaling) before evaluating metrics again. The behavior field allows fine-tuning of these stabilization windows and the policies for how many pods can be added or removed per evaluation cycle. This ensures scaling decisions are stable and cost-effective, avoiding reactive scaling to transient traffic spikes.
Integration with Custom Metrics
For modern, microservices-based applications, scaling on CPU is often insufficient. HPA's power is fully realized through integration with the Custom Metrics API and External Metrics API. This requires a metrics adapter like Prometheus Adapter or Datadog Cluster Agent, which translates application-level metrics from monitoring systems into a format HPA can consume. This enables scaling based on:
- QPS (Queries Per Second) for a web service.
- Message backlog in a Kafka consumer.
- Average latency exceeding a threshold.
- Any business logic metric exposed by the application.
Coordination with Cluster Autoscaler
HPA scales pods within the resources of a node pool. For true elasticity, it must be paired with the Cluster Autoscaler. If HPA requests new pods but there are insufficient node resources (CPU/memory), the Cluster Autoscaler detects the unschedulable pods and provisions new nodes in the cloud. Conversely, if nodes become underutilized after HPA scales down, the Cluster Autoscaler can remove nodes to reduce infrastructure costs. This creates a fully automated, two-layer scaling system: HPA for application pods, Cluster Autoscaler for infrastructure nodes.
HPA vs. Other Scaling Methods
A feature comparison of Kubernetes' Horizontal Pod Autoscaler against alternative scaling approaches for containerized workloads.
| Scaling Feature / Metric | Horizontal Pod Autoscaler (HPA) | Vertical Pod Autoscaler (VPA) | Cluster Autoscaler (CA) | Manual Scaling |
|---|---|---|---|---|
Primary Scaling Dimension | Number of Pods (Horizontal) | Pod Resource Requests/Limits (Vertical) | Number of Cluster Nodes | Number of Pods or Nodes |
Core Trigger Mechanism | Observed CPU/Memory or Custom Metrics | Observed Resource Usage vs. Requests | Pending Pods due to Insufficient Resources | Human Operator Decision |
Typical Scaling Latency | < 30 seconds | 1-2 minutes | 1-5 minutes (node provisioning) | Minutes to Hours |
Supports Custom Metrics | ||||
Requires Application Restart | ||||
Handles Traffic Spikes | ||||
Optimizes Resource Efficiency | ||||
Cost Optimization Focus | Right-sizing replica count | Right-sizing resource requests | Right-sizing node pool | |
Integration Complexity | Medium (requires metrics server) | High (requires VPA admission controller) | Low (cloud-provider specific) | Low |
HPA in Cloud and LLM Environments
The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other custom metrics. In LLM serving, it is critical for managing the variable and resource-intensive nature of inference requests.
Core Scaling Mechanism
The HPA operates on a continuous control loop. It queries the Kubernetes Metrics Server or a custom metrics API (like Prometheus Adapter) at a default interval of 15 seconds to collect metrics for targeted resources. It then compares the current metric value (e.g., average CPU utilization across all pods) against the target value defined in the HPA specification. Using a simple formula:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
It calculates the desired number of pods and instructs the deployment or replica set controller to scale accordingly. This ensures resource consumption aligns dynamically with application load.
Custom Metrics for LLM Inference
CPU utilization is often insufficient for scaling LLM inference workloads, which are constrained by GPU memory and throughput. Effective HPA configuration for LLMs relies on custom metrics such as:
- Request Queue Length: The number of inference requests waiting in a queue.
- Average Token Generation Latency: The time taken per output token, which increases under load.
- GPU Memory Utilization: Percentage of GPU VRAM in use.
- Concurrent Requests per Pod: The number of active requests a pod is handling.
These metrics are exposed by inference servers like vLLM or TGI and collected via Prometheus. An HPA can be configured to scale based on a target average queue length (e.g., scale up if queue length > 5) to maintain low latency.
Scaling Behaviors & Stabilization
To prevent rapid, flapping scaling actions, HPA provides stabilization window controls:
- scaleUpStabilizationWindowSeconds: Minimum time the metric must indicate a need to scale up before acting (default 0).
- scaleDownStabilizationWindowSeconds: Minimum time the metric must indicate a need to scale down before acting (default 300 seconds/5 minutes).
Policies define how many pods can be added or removed in a single action:
policies: [{type: Pods, value: 4, periodSeconds: 60}]limits scaling to 4 pods per minute. For LLMs, a conservative scale-down policy is crucial because pod startup (loading a multi-GB model) can take minutes. Aggressive scale-down can lead to thrashing.
Integration with Cluster Autoscaler
HPA scales pods within the constraints of available cluster nodes. The Cluster Autoscaler (CA) complements HPA by automatically adjusting the number of nodes in the node pool. The workflow is:
- HPA requests new pods due to high load.
- If there are insufficient resources on existing nodes, the new pods enter a Pending state.
- The Cluster Autoscaler detects pending pods and provisions a new node in the cloud provider.
- Once the node is ready, the pods are scheduled and start.
For GPU-based LLM pods, this requires node pools with the appropriate accelerator type. The CA's provisioning time (often 1-3 minutes for GPU nodes) is a key factor in overall scaling latency.
LLM-Specific Challenges & Patterns
Scaling stateful LLM inference presents unique challenges:
- Cold Start Latency: Loading a 10B+ parameter model into GPU memory can take 30-60 seconds. HPA scaling events directly impact end-user latency.
- Pod Resource Requests: LLM pods must define precise
resources.requestsfor GPU (nvidia.com/gpu), CPU, and memory. HPA cannot scale pods that cannot be scheduled. - Multi-Model Endpoints: A single deployment may host multiple model adapters. Scaling metrics must aggregate load across all served models.
A common pattern is to use HPA with a minimum replica count > 0 (e.g., 2) to maintain a warm pool of ready pods, absorbing traffic spikes while slower scale-out occurs.
Advanced Use: External Metrics & Prometheus
For business-level scaling, HPA can use External Metrics sourced from systems outside Kubernetes. For example, scaling based on:
- Messages in a Kafka topic (e.g., pending inference jobs).
- Custom business logic metrics from an application.
This is typically implemented using the Prometheus Adapter, which translates Prometheus queries into metrics the Kubernetes custom metrics API understands. A sample HPA spec for an external metric:
yamlmetrics: - type: External external: metric: name: queue_messages_ready target: type: AverageValue averageValue: 10
This would scale the deployment to maintain an average of 10 ready messages per pod.
Frequently Asked Questions
Essential questions about the Kubernetes Horizontal Pod Autoscaler (HPA), the core controller for automatically scaling application pods based on observed demand.
The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It works by periodically querying the Kubernetes Metrics API (or a custom metrics API) to check the current metric values against the target values defined in the HPA resource. If the observed average utilization is above the target, the HPA increases the replica count (scaling out). If it is below the target, it decreases the replica count (scaling in), down to a defined minimum.
Core Workflow:
- The HPA controller checks metrics every 15-30 seconds (configurable via
--horizontal-pod-autoscaler-sync-period). - It calculates the desired replica count using the formula:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]. - It updates the
.spec.replicasfield of the target workload, and the Deployment controller takes over to create or terminate pods to match the new desired state.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The Horizontal Pod Autoscaler (HPA) is a core component of Kubernetes' automated scaling ecosystem. Understanding these related concepts is essential for designing resilient, cost-effective deployments.
Vertical Pod Autoscaler (VPA)
A Kubernetes controller that automatically adjusts the CPU and memory resource requests and limits for pods based on historical usage, rather than scaling the number of pod replicas. It is complementary to HPA:
- Right-sizes resources: Prevents over-provisioning and under-provisioning at the container level.
- Memory-based scaling: Can scale based on memory usage, where HPA's default support is more limited.
- Usage: Often used for stateful workloads where horizontal scaling is complex, or in conjunction with HPA for comprehensive resource optimization.
Cluster Autoscaler
A Kubernetes component that automatically adjusts the size of the node pool in a cluster. It works in tandem with HPA and VPA:
- Node-level scaling: Adds or removes worker nodes from the cluster based on pending pods that cannot be scheduled due to resource constraints.
- Reacts to pod scaling: When HPA creates new pods that require capacity, the Cluster Autoscaler provisions new nodes to host them.
- Cost optimization: Scales down nodes when they are underutilized, reducing cloud infrastructure costs.
Custom Metrics API
A Kubernetes extension API that allows the HPA to scale based on application-specific metrics beyond default CPU and memory. This is critical for scaling modern, metric-rich applications:
- Beyond CPU: Enables scaling based on queries per second (QPS), queue length, application latency, or business logic metrics (e.g., orders per minute).
- Integration: Requires a metrics adapter (like Prometheus Adapter) to translate metrics from monitoring systems (Prometheus, Datadog) into the Kubernetes API.
- Declarative scaling: Pods can be scaled based on the exact metric that indicates true load for the application.
Pod Disruption Budget (PDB)
A Kubernetes policy that limits the number of concurrent voluntary disruptions to a set of pods. It is crucial for maintaining availability during operations that involve HPA and cluster changes:
- Voluntary disruptions: Includes actions initiated by the cluster, like node drain for scaling down, or rolling updates.
- Protects during scaling: When Cluster Autoscaler removes a node, PDBs ensure a minimum number of application pods remain available.
- Policy types: Defines
minAvailable(e.g., "at least 3 pods must run") ormaxUnavailable(e.g., "at most 1 pod can be down") for an application.
Service Mesh & Istio Telemetry
A dedicated infrastructure layer that provides advanced traffic management and rich telemetry, which can be used as a source for HPA custom metrics:
- Rich application metrics: Provides out-of-the-box metrics like request rate, error rate, and latency (p95, p99) between services.
- Traffic-based scaling: HPA can be configured to scale deployments based on the request rate or latency metrics emitted by the service mesh (e.g., Istio, Linkerd).
- Canary analysis integration: Enables sophisticated progressive delivery where traffic splitting and autoscaling work in concert for safe rollouts.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us