Inferensys

Glossary

Horizontal Pod Autoscaler (HPA)

A Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other custom metrics.
DevOps engineer deploying LLM to production on laptop, Kubernetes dashboards visible, late night deployment session.
KUBERNETES AUTOSCALING

What is Horizontal Pod Autoscaler (HPA)?

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller for dynamic resource management.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically scales the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It operates by querying the Kubernetes Metrics Server for resource usage data, comparing it against target values defined by the user, and instructing the workload's controller to add or remove pods to maintain the desired performance level. This provides a foundational mechanism for elastic scaling to match application demand.

For advanced use cases, HPA can scale based on custom metrics and external metrics provided by the Kubernetes Custom Metrics API. This enables scaling driven by application-specific signals like requests per second or queue length. In the context of Large Language Model Operations, HPA is critical for managing the variable inference load of LLM endpoints, ensuring pods scale out during traffic spikes and scale in during lulls to optimize cloud infrastructure costs and maintain service level objectives (SLOs) for latency and availability.

KUBERNETES AUTOSCALING

Key Features of HPA

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically adjusts the number of pod replicas in a deployment or replica set. Its primary function is to ensure application performance and optimize resource utilization by scaling based on observed metrics.

01

Metric-Driven Scaling

HPA scales workloads based on observed metrics, not schedules. The default and most common metric is average CPU utilization. It can also scale based on average memory utilization and, critically, custom and external metrics via the Kubernetes Metrics API. This allows scaling on application-specific KPIs like requests per second, queue length, or any business metric exposed by a custom adapter.

  • Core Metrics: CPU, Memory (via resource metrics API).
  • Custom/External Metrics: Application-specific metrics (e.g., http_requests_per_second) via the custom metrics API.
  • ContainerResource Metrics: Scale based on the resource usage of a specific container within a pod.
02

Declarative Configuration

Scaling behavior is defined declaratively through a Kubernetes HorizontalPodAutoscaler resource manifest. Engineers specify the target metric (e.g., 70% CPU utilization), the minimum and maximum number of replicas, and optional stabilization windows. The HPA controller continuously works to reconcile the actual state (current metric values) with this declared desired state.

Key configuration fields include:

  • scaleTargetRef: The Deployment, StatefulSet, or other scalable resource to manage.
  • metrics: List of target metrics and their desired values.
  • minReplicas / maxReplicas: The bounds of the scaling range.
  • behavior: Configures scaling speed and stabilization to prevent flapping.
03

Cool-Down & Stabilization

To prevent rapid, unnecessary scaling oscillations (thrashing), HPA implements cool-down delays. After a scaling operation, it waits a period (default 5 minutes for upscaling, 3 minutes for downscaling) before evaluating metrics again. The behavior field allows fine-tuning of these stabilization windows and the policies for how many pods can be added or removed per evaluation cycle. This ensures scaling decisions are stable and cost-effective, avoiding reactive scaling to transient traffic spikes.

04

Integration with Custom Metrics

For modern, microservices-based applications, scaling on CPU is often insufficient. HPA's power is fully realized through integration with the Custom Metrics API and External Metrics API. This requires a metrics adapter like Prometheus Adapter or Datadog Cluster Agent, which translates application-level metrics from monitoring systems into a format HPA can consume. This enables scaling based on:

  • QPS (Queries Per Second) for a web service.
  • Message backlog in a Kafka consumer.
  • Average latency exceeding a threshold.
  • Any business logic metric exposed by the application.
06

Coordination with Cluster Autoscaler

HPA scales pods within the resources of a node pool. For true elasticity, it must be paired with the Cluster Autoscaler. If HPA requests new pods but there are insufficient node resources (CPU/memory), the Cluster Autoscaler detects the unschedulable pods and provisions new nodes in the cloud. Conversely, if nodes become underutilized after HPA scales down, the Cluster Autoscaler can remove nodes to reduce infrastructure costs. This creates a fully automated, two-layer scaling system: HPA for application pods, Cluster Autoscaler for infrastructure nodes.

COMPARISON

HPA vs. Other Scaling Methods

A feature comparison of Kubernetes' Horizontal Pod Autoscaler against alternative scaling approaches for containerized workloads.

Scaling Feature / MetricHorizontal Pod Autoscaler (HPA)Vertical Pod Autoscaler (VPA)Cluster Autoscaler (CA)Manual Scaling

Primary Scaling Dimension

Number of Pods (Horizontal)

Pod Resource Requests/Limits (Vertical)

Number of Cluster Nodes

Number of Pods or Nodes

Core Trigger Mechanism

Observed CPU/Memory or Custom Metrics

Observed Resource Usage vs. Requests

Pending Pods due to Insufficient Resources

Human Operator Decision

Typical Scaling Latency

< 30 seconds

1-2 minutes

1-5 minutes (node provisioning)

Minutes to Hours

Supports Custom Metrics

Requires Application Restart

Handles Traffic Spikes

Optimizes Resource Efficiency

Cost Optimization Focus

Right-sizing replica count

Right-sizing resource requests

Right-sizing node pool

Integration Complexity

Medium (requires metrics server)

High (requires VPA admission controller)

Low (cloud-provider specific)

Low

KUBERNETES CONTROLLER

HPA in Cloud and LLM Environments

The Horizontal Pod Autoscaler (HPA) is a core Kubernetes controller that automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other custom metrics. In LLM serving, it is critical for managing the variable and resource-intensive nature of inference requests.

01

Core Scaling Mechanism

The HPA operates on a continuous control loop. It queries the Kubernetes Metrics Server or a custom metrics API (like Prometheus Adapter) at a default interval of 15 seconds to collect metrics for targeted resources. It then compares the current metric value (e.g., average CPU utilization across all pods) against the target value defined in the HPA specification. Using a simple formula:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

It calculates the desired number of pods and instructs the deployment or replica set controller to scale accordingly. This ensures resource consumption aligns dynamically with application load.

02

Custom Metrics for LLM Inference

CPU utilization is often insufficient for scaling LLM inference workloads, which are constrained by GPU memory and throughput. Effective HPA configuration for LLMs relies on custom metrics such as:

  • Request Queue Length: The number of inference requests waiting in a queue.
  • Average Token Generation Latency: The time taken per output token, which increases under load.
  • GPU Memory Utilization: Percentage of GPU VRAM in use.
  • Concurrent Requests per Pod: The number of active requests a pod is handling.

These metrics are exposed by inference servers like vLLM or TGI and collected via Prometheus. An HPA can be configured to scale based on a target average queue length (e.g., scale up if queue length > 5) to maintain low latency.

03

Scaling Behaviors & Stabilization

To prevent rapid, flapping scaling actions, HPA provides stabilization window controls:

  • scaleUpStabilizationWindowSeconds: Minimum time the metric must indicate a need to scale up before acting (default 0).
  • scaleDownStabilizationWindowSeconds: Minimum time the metric must indicate a need to scale down before acting (default 300 seconds/5 minutes).

Policies define how many pods can be added or removed in a single action:

  • policies: [{type: Pods, value: 4, periodSeconds: 60}] limits scaling to 4 pods per minute. For LLMs, a conservative scale-down policy is crucial because pod startup (loading a multi-GB model) can take minutes. Aggressive scale-down can lead to thrashing.
04

Integration with Cluster Autoscaler

HPA scales pods within the constraints of available cluster nodes. The Cluster Autoscaler (CA) complements HPA by automatically adjusting the number of nodes in the node pool. The workflow is:

  1. HPA requests new pods due to high load.
  2. If there are insufficient resources on existing nodes, the new pods enter a Pending state.
  3. The Cluster Autoscaler detects pending pods and provisions a new node in the cloud provider.
  4. Once the node is ready, the pods are scheduled and start.

For GPU-based LLM pods, this requires node pools with the appropriate accelerator type. The CA's provisioning time (often 1-3 minutes for GPU nodes) is a key factor in overall scaling latency.

05

LLM-Specific Challenges & Patterns

Scaling stateful LLM inference presents unique challenges:

  • Cold Start Latency: Loading a 10B+ parameter model into GPU memory can take 30-60 seconds. HPA scaling events directly impact end-user latency.
  • Pod Resource Requests: LLM pods must define precise resources.requests for GPU (nvidia.com/gpu), CPU, and memory. HPA cannot scale pods that cannot be scheduled.
  • Multi-Model Endpoints: A single deployment may host multiple model adapters. Scaling metrics must aggregate load across all served models.

A common pattern is to use HPA with a minimum replica count > 0 (e.g., 2) to maintain a warm pool of ready pods, absorbing traffic spikes while slower scale-out occurs.

06

Advanced Use: External Metrics & Prometheus

For business-level scaling, HPA can use External Metrics sourced from systems outside Kubernetes. For example, scaling based on:

  • Messages in a Kafka topic (e.g., pending inference jobs).
  • Custom business logic metrics from an application.

This is typically implemented using the Prometheus Adapter, which translates Prometheus queries into metrics the Kubernetes custom metrics API understands. A sample HPA spec for an external metric:

yaml
metrics:
- type: External
  external:
    metric:
      name: queue_messages_ready
    target:
      type: AverageValue
      averageValue: 10

This would scale the deployment to maintain an average of 10 ready messages per pod.

HORIZONTAL POD AUTOSCALER

Frequently Asked Questions

Essential questions about the Kubernetes Horizontal Pod Autoscaler (HPA), the core controller for automatically scaling application pods based on observed demand.

The Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed utilization of specified metrics, such as CPU or memory. It works by periodically querying the Kubernetes Metrics API (or a custom metrics API) to check the current metric values against the target values defined in the HPA resource. If the observed average utilization is above the target, the HPA increases the replica count (scaling out). If it is below the target, it decreases the replica count (scaling in), down to a defined minimum.

Core Workflow:

  1. The HPA controller checks metrics every 15-30 seconds (configurable via --horizontal-pod-autoscaler-sync-period).
  2. It calculates the desired replica count using the formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)].
  3. It updates the .spec.replicas field of the target workload, and the Deployment controller takes over to create or terminate pods to match the new desired state.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.