Glossary

Horizontal Pod Autoscaler (HPA)

A Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other custom metrics.

Get in touch Learn more

DevOps engineer deploying LLM to production on laptop, Kubernetes dashboards visible, late night deployment session.

KUBERNETES AUTOSCALING

What is Horizontal Pod Autoscaler (HPA)?

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller for dynamic resource management.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically scales the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It operates by querying the Kubernetes Metrics Server for resource usage data, comparing it against target values defined by the user, and instructing the workload's controller to add or remove pods to maintain the desired performance level. This provides a foundational mechanism for elastic scaling to match application demand.

For advanced use cases, HPA can scale based on custom metrics and external metrics provided by the Kubernetes Custom Metrics API. This enables scaling driven by application-specific signals like requests per second or queue length. In the context of Large Language Model Operations, HPA is critical for managing the variable inference load of LLM endpoints, ensuring pods scale out during traffic spikes and scale in during lulls to optimize cloud infrastructure costs and maintain service level objectives (SLOs) for latency and availability.

KUBERNETES AUTOSCALING

Key Features of HPA

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set. Its primary function is to ensure application performance and optimize resource utilization by scaling based on observed metrics.

Metric-Driven Scaling

HPA scales workloads based on observed metrics, not schedules. The default and most common metric is average CPU utilization. It can also scale based on average memory utilization and, critically, custom and external metrics via the Kubernetes Metrics API. This allows scaling on application-specific KPIs like requests per second, queue length, or any business metric exposed by a custom adapter.

Core Metrics: CPU, Memory (via resource metrics API).
Custom/External Metrics: Application-specific metrics (e.g., http_requests_per_second) via the custom metrics API.
ContainerResource Metrics: Scale based on the resource usage of a specific container within a pod.

Declarative Configuration

Scaling behavior is defined declaratively through a Kubernetes HorizontalPodAutoscaler resource manifest. Engineers specify the target metric (e.g., 70% CPU utilization), the minimum and maximum number of replicas, and optional stabilization windows. The HPA controller continuously works to reconcile the actual state (current metric values) with this declared desired state.

Key configuration fields include:

scaleTargetRef: The Deployment, StatefulSet, or other scalable resource to manage.
metrics: List of target metrics and their desired values.
minReplicas / maxReplicas: The bounds of the scaling range.
behavior: Configures scaling speed and stabilization to prevent flapping.

Cool-Down & Stabilization

To prevent rapid, unnecessary scaling oscillations (thrashing), HPA implements cool-down delays. After a scaling operation, it waits a period (default 5 minutes for upscaling, 3 minutes for downscaling) before evaluating metrics again. The behavior field allows fine-tuning of these stabilization windows and the policies for how many pods can be added or removed per evaluation cycle. This ensures scaling decisions are stable and cost-effective, avoiding reactive scaling to transient traffic spikes.

Integration with Custom Metrics

For modern, microservices-based applications, scaling on CPU is often insufficient. HPA's power is fully realized through integration with the Custom Metrics API and External Metrics API. This requires a metrics adapter like Prometheus Adapter or Datadog Cluster Agent, which translates application-level metrics from monitoring systems into a format HPA can consume. This enables scaling based on:

QPS (Queries Per Second) for a web service.
Message backlog in a Kafka consumer.
Average latency exceeding a threshold.
Any business logic metric exposed by the application.

Event-Driven Autoscaling (KEDA)

While HPA is metric-driven, the Kubernetes Event-Driven Autoscaling (KEDA) project extends this paradigm. KEDA acts as a metrics server for HPA, translating events from over 100+ sources (Azure Service Bus, AWS SQS, Redis Streams, etc.) into scaling metrics. It can scale from zero to N replicas based on event queue depth, making it ideal for serverless-style workloads, batch processors, and event-driven architectures. KEDA manages the HPA object lifecycle, creating it when events are present and pausing it when there is no work.

EXPLORE

Coordination with Cluster Autoscaler

HPA scales pods within the resources of a node pool. For true elasticity, it must be paired with the Cluster Autoscaler. If HPA requests new pods but there are insufficient node resources (CPU/memory), the Cluster Autoscaler detects the unschedulable pods and provisions new nodes in the cloud. Conversely, if nodes become underutilized after HPA scales down, the Cluster Autoscaler can remove nodes to reduce infrastructure costs. This creates a fully automated, two-layer scaling system: HPA for application pods, Cluster Autoscaler for infrastructure nodes.

COMPARISON

HPA vs. Other Scaling Methods

A feature comparison of Kubernetes' Horizontal Pod Autoscaler against alternative scaling approaches for containerized workloads.

Scaling Feature / Metric	Horizontal Pod Autoscaler (HPA)	Vertical Pod Autoscaler (VPA)	Cluster Autoscaler (CA)	Manual Scaling
Primary Scaling Dimension	Number of Pods (Horizontal)	Pod Resource Requests/Limits (Vertical)	Number of Cluster Nodes	Number of Pods or Nodes
Core Trigger Mechanism	Observed CPU/Memory or Custom Metrics	Observed Resource Usage vs. Requests	Pending Pods due to Insufficient Resources	Human Operator Decision
Typical Scaling Latency	< 30 seconds	1-2 minutes	1-5 minutes (node provisioning)	Minutes to Hours
Supports Custom Metrics
Requires Application Restart
Handles Traffic Spikes
Optimizes Resource Efficiency
Cost Optimization Focus	Right-sizing replica count	Right-sizing resource requests	Right-sizing node pool
Integration Complexity	Medium (requires metrics server)	High (requires VPA admission controller)	Low (cloud-provider specific)	Low

KUBERNETES CONTROLLER

HPA in Cloud and LLM Environments

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other custom metrics. In LLM serving, it is critical for managing the variable and resource-intensive nature of inference requests.

Core Scaling Mechanism

The HPA operates on a continuous control loop. It queries the Kubernetes Metrics Server or a custom metrics API (like Prometheus Adapter) at a default interval of 15 seconds to collect metrics for targeted resources. It then compares the current metric value (e.g., average CPU utilization across all pods) against the target value defined in the HPA specification. Using a simple formula:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

It calculates the desired number of pods and instructs the deployment or replica set controller to scale accordingly. This ensures resource consumption aligns dynamically with application load.

Custom Metrics for LLM Inference

CPU utilization is often insufficient for scaling LLM inference workloads, which are constrained by GPU memory and throughput. Effective HPA configuration for LLMs relies on custom metrics such as:

Request Queue Length: The number of inference requests waiting in a queue.
Average Token Generation Latency: The time taken per output token, which increases under load.
GPU Memory Utilization: Percentage of GPU VRAM in use.
Concurrent Requests per Pod: The number of active requests a pod is handling.

These metrics are exposed by inference servers like vLLM or TGI and collected via Prometheus. An HPA can be configured to scale based on a target average queue length (e.g., scale up if queue length > 5) to maintain low latency.

Scaling Behaviors & Stabilization

To prevent rapid, flapping scaling actions, HPA provides stabilization window controls:

scaleUpStabilizationWindowSeconds: Minimum time the metric must indicate a need to scale up before acting (default 0).
scaleDownStabilizationWindowSeconds: Minimum time the metric must indicate a need to scale down before acting (default 300 seconds/5 minutes).

Policies define how many pods can be added or removed in a single action:

policies: [{type: Pods, value: 4, periodSeconds: 60}] limits scaling to 4 pods per minute. For LLMs, a conservative scale-down policy is crucial because pod startup (loading a multi-GB model) can take minutes. Aggressive scale-down can lead to thrashing.

Integration with Cluster Autoscaler

HPA scales pods within the constraints of available cluster nodes. The Cluster Autoscaler (CA) complements HPA by automatically adjusting the number of nodes in the node pool. The workflow is:

HPA requests new pods due to high load.
If there are insufficient resources on existing nodes, the new pods enter a Pending state.
The Cluster Autoscaler detects pending pods and provisions a new node in the cloud provider.
Once the node is ready, the pods are scheduled and start.

For GPU-based LLM pods, this requires node pools with the appropriate accelerator type. The CA's provisioning time (often 1-3 minutes for GPU nodes) is a key factor in overall scaling latency.

LLM-Specific Challenges & Patterns

Scaling stateful LLM inference presents unique challenges:

Cold Start Latency: Loading a 10B+ parameter model into GPU memory can take 30-60 seconds. HPA scaling events directly impact end-user latency.
Pod Resource Requests: LLM pods must define precise resources.requests for GPU (nvidia.com/gpu), CPU, and memory. HPA cannot scale pods that cannot be scheduled.
Multi-Model Endpoints: A single deployment may host multiple model adapters. Scaling metrics must aggregate load across all served models.

A common pattern is to use HPA with a minimum replica count > 0 (e.g., 2) to maintain a warm pool of ready pods, absorbing traffic spikes while slower scale-out occurs.

Advanced Use: External Metrics & Prometheus

For business-level scaling, HPA can use External Metrics sourced from systems outside Kubernetes. For example, scaling based on:

Messages in a Kafka topic (e.g., pending inference jobs).
Custom business logic metrics from an application.

This is typically implemented using the Prometheus Adapter, which translates Prometheus queries into metrics the Kubernetes custom metrics API understands. A sample HPA spec for an external metric:

yaml
metrics:
- type: External
  external:
    metric:
      name: queue_messages_ready
    target:
      type: AverageValue
      averageValue: 10

This would scale the deployment to maintain an average of 10 ready messages per pod.

HORIZONTAL POD AUTOSCALER

Frequently Asked Questions

Essential questions about the Kubernetes Horizontal Pod Autoscaler (HPA), the core controller for automatically scaling application pods based on observed demand.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It works by periodically querying the Kubernetes Metrics API (or a custom metrics API) to check the current metric values against the target values defined in the HPA resource. If the observed average utilization is above the target, the HPA increases the replica count (scaling out). If it is below the target, it decreases the replica count (scaling in), down to a defined minimum.

Core Workflow:

The HPA controller checks metrics every 15-30 seconds (configurable via --horizontal-pod-autoscaler-sync-period).
It calculates the desired replica count using the formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)].
It updates the .spec.replicas field of the target workload, and the Deployment controller takes over to create or terminate pods to match the new desired state.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

TRAFFIC AND DEPLOYMENT STRATEGIES

Related Terms

The Horizontal Pod Autoscaler (HPA) is a core component of Kubernetes' automated scaling ecosystem. Understanding these related concepts is essential for designing resilient, cost-effective deployments.

Vertical Pod Autoscaler (VPA)

A Kubernetes controller that automatically adjusts the CPU and memory resource requests and limits for pods based on historical usage, rather than scaling the number of pod replicas. It is complementary to HPA:

Right-sizes resources: Prevents over-provisioning and under-provisioning at the container level.
Memory-based scaling: Can scale based on memory usage, where HPA's default support is more limited.
Usage: Often used for stateful workloads where horizontal scaling is complex, or in conjunction with HPA for comprehensive resource optimization.

Cluster Autoscaler

A Kubernetes component that automatically adjusts the size of the node pool in a cluster. It works in tandem with HPA and VPA:

Node-level scaling: Adds or removes worker nodes from the cluster based on pending pods that cannot be scheduled due to resource constraints.
Reacts to pod scaling: When HPA creates new pods that require capacity, the Cluster Autoscaler provisions new nodes to host them.
Cost optimization: Scales down nodes when they are underutilized, reducing cloud infrastructure costs.

Custom Metrics API

A Kubernetes extension API that allows the HPA to scale based on application-specific metrics beyond default CPU and memory. This is critical for scaling modern, metric-rich applications:

Beyond CPU: Enables scaling based on queries per second (QPS), queue length, application latency, or business logic metrics (e.g., orders per minute).
Integration: Requires a metrics adapter (like Prometheus Adapter) to translate metrics from monitoring systems (Prometheus, Datadog) into the Kubernetes API.
Declarative scaling: Pods can be scaled based on the exact metric that indicates true load for the application.

Pod Disruption Budget (PDB)

A Kubernetes policy that limits the number of concurrent voluntary disruptions to a set of pods. It is crucial for maintaining availability during operations that involve HPA and cluster changes:

Voluntary disruptions: Includes actions initiated by the cluster, like node drain for scaling down, or rolling updates.
Protects during scaling: When Cluster Autoscaler removes a node, PDBs ensure a minimum number of application pods remain available.
Policy types: Defines minAvailable (e.g., "at least 3 pods must run") or maxUnavailable (e.g., "at most 1 pod can be down") for an application.

Kubernetes Event-driven Autoscaling (KEDA)

A single-purpose, lightweight component that extends HPA to scale workloads based on events from external systems. It bridges Kubernetes with event sources:

Event-driven scaling: Scales from zero to N based on queue depth (Azure Service Bus, Apache Kafka), database metrics, or custom HTTP events.
HPA as core: KEDA acts as a metrics adapter for HPA, managing the lifecycle of the scaler and feeding custom metrics.
Use case: Ideal for serverless-style workloads, batch jobs, and microservices triggered by event queues, where scaling based on CPU is ineffective.

EXPLORE

Service Mesh & Istio Telemetry

A dedicated infrastructure layer that provides advanced traffic management and rich telemetry, which can be used as a source for HPA custom metrics:

Rich application metrics: Provides out-of-the-box metrics like request rate, error rate, and latency (p95, p99) between services.
Traffic-based scaling: HPA can be configured to scale deployments based on the request rate or latency metrics emitted by the service mesh (e.g., Istio, Linkerd).
Canary analysis integration: Enables sophisticated progressive delivery where traffic splitting and autoscaling work in concert for safe rollouts.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.