Glossary

Auto-Scaling

Auto-scaling is a cloud infrastructure capability that automatically adjusts the number of compute instances or pods running a machine learning model service based on real-time demand metrics.

Get in touch Learn more

MLOps engineer reviewing model serving infrastructure on laptop, container orchestration visible, technical workspace.

MODEL SERVING ARCHITECTURES

What is Auto-Scaling?

Auto-scaling is a core capability of modern cloud and container orchestration platforms, enabling infrastructure to automatically adapt to fluctuating demand for model inference services.

Auto-scaling is the capability of a cloud or container orchestration platform to automatically adjust the number of compute instances or pods running a model service based on real-time demand metrics like CPU utilization, memory pressure, or request rate. This dynamic resource management is fundamental to inference cost optimization, ensuring infrastructure scales out to meet traffic spikes and scales in during lulls to minimize idle compute expenditure. It directly addresses the CTO's mandate for infrastructure cost control by aligning resource consumption with actual workload.

In model serving architectures, auto-scaling is typically implemented using a horizontal pod autoscaler (HPA) in Kubernetes or similar services in cloud platforms, which monitors custom metrics from an inference server. Effective configuration requires defining scaling policies, minimum/maximum replica counts, and appropriate cooldown periods to prevent thrashing. This capability is essential for maintaining consistent latency and throughput under variable load, forming a critical component of a resilient ML Ops pipeline for production AI systems.

AUTO-SCALING

Key Scaling Metrics for ML Inference

Auto-scaling for ML inference dynamically adjusts compute resources based on real-time demand. Effective scaling requires monitoring specific, actionable metrics that signal when to provision or de-provision instances.

Requests Per Second (RPS)

Requests Per Second (RPS) is the primary driver of scaling decisions, measuring the incoming load on the inference service. Auto-scaling policies typically trigger scale-out events when RPS exceeds a predefined threshold for a sustained period.

Scaling Trigger: A sustained average RPS above the configured target.
Considerations: Sudden, spiky traffic requires aggressive scaling policies, while steady traffic allows for more conservative thresholds. It must be correlated with latency to ensure scaling maintains performance.

End-to-End Latency (P50, P95, P99)

End-to-End Latency measures the total time from receiving a request to returning a response, including network overhead and queuing. Auto-scaling aims to keep latency percentiles (P50, P95, P99) within service-level objectives (SLOs).

Key Percentiles: The P95 and P99 latencies are critical for scaling, as they indicate tail performance degradation under load.
Scaling Signal: A consistent rise in P95/P99 latency, while RPS is high, indicates insufficient compute capacity and triggers a scale-out. Falling latency with low RPS can trigger scale-in.

GPU/CPU Utilization

GPU Utilization (for accelerated inference) and CPU Utilization are direct measures of hardware resource consumption. They are core metrics for horizontal pod autoscaling (HPA) in Kubernetes.

Target Utilization: A common target is 70-80% average utilization across the pod fleet. Sustained averages above this threshold trigger scaling.
GPU-Specific Metrics: For ML inference, GPU Memory Utilization and SM (Streaming Multiprocessor) Activity are more precise indicators of model load than overall GPU utilization alone.

Concurrent Requests / Queue Depth

Concurrent Requests (active requests being processed) and Queue Depth (requests waiting for a worker) are leading indicators of saturation. Growing queues directly cause increased latency.

Scaling Trigger: An increasing average queue depth or number of concurrent requests per pod.
Use with Batching: In systems using continuous batching, the batch size and scheduling efficiency become key scaling considerations alongside simple request count.

Cost Per Inference

Cost Per Inference is the financial metric optimized by auto-scaling, calculated as (instance cost) / (inferences served). Effective scaling maximizes throughput during demand bursts and minimizes idle resource cost during lulls.

Scale-In Impact: Aggressive scale-in reduces idle cost but risks cold starts when traffic spikes return.
Trade-off: Auto-scaling configuration is a direct trade-off between latency performance (more replicas) and inference cost (fewer replicas).

Pod Startup Time & Cold Start Impact

Pod Startup Time is the duration from scaling decision to a new inference pod being ready to serve. This includes pulling the container image, loading the model into GPU memory, and initializing the server.

Critical for Scaling Policy: A long startup time (e.g., 60+ seconds for large models) requires proactive scaling (predictive scaling) or higher minimum replica counts to avoid latency spikes.
Metric to Monitor: The time to first successful request for a new pod. Scaling policies must account for this lag to be effective.

POLICY TYPES

Auto-Scaling Policy Comparison

A comparison of common auto-scaling policies used to adjust compute resources for model inference services based on real-time demand.

Policy / Metric	Reactive Scaling	Predictive Scaling	Scheduled Scaling
Primary Trigger	Real-time metric threshold (e.g., CPU > 70%)	Forecasted demand (time-series prediction)	Pre-defined schedule (e.g., 9 AM - 5 PM)
Scaling Speed	Fast (< 1 min to trigger)	Proactive (scales before load arrives)	Deterministic (exact schedule)
Best For Workload	Spiky, unpredictable traffic	Predictable, cyclical patterns (daily/weekly)	Known business hours or events
Configuration Complexity	Low (set min/max, thresholds)	High (requires historical data & model)	Medium (define cron schedules)
Resource Efficiency	Medium (reacts to, doesn't prevent, load)	High (optimizes for predicted load)	Low (may over/under-provision)
Cold Start Impact	High (scales from zero under load)	Low (can pre-warm instances)	Medium (can align with pre-warming)
Cost Predictability	Low (varies with traffic)	Medium (follows forecast trend)	High (known schedule)
Common Use Case	Sudden viral API request spike	Retail inference service scaling for peak shopping hours	Internal model API scaled down overnight

IMPLEMENTATION

Common Auto-Scaling Implementation Platforms

Auto-scaling is a core capability of modern cloud and container orchestration platforms. These systems automatically adjust compute resources for model inference services based on real-time demand metrics like CPU utilization, memory pressure, or request rate.

Kubernetes Horizontal Pod Autoscaler (HPA)

The Kubernetes Horizontal Pod Autoscaler (HPA) is the native, declarative scaling controller for containerized workloads. It automatically adjusts the number of pods in a deployment or replica set based on observed CPU utilization, memory consumption, or custom metrics exposed via the Kubernetes Metrics API.

Core Mechanism: Continuously monitors specified metrics against target values. Scales the pod count up or down to maintain the desired average metric value.
Custom Metrics: Can scale based on application-specific metrics like queries per second (QPS) or average inference latency, typically via Prometheus Adapter.
Use Case: Ideal for scaling stateless model inference services (e.g., Triton, TorchServe) deployed as Kubernetes deployments. It is the standard for cloud-native, containerized ML serving.

EXPLORE

Amazon EC2 Auto Scaling Groups

Amazon EC2 Auto Scaling Groups (ASGs) are a fundamental AWS service for scaling groups of EC2 virtual machine instances. They ensure a specified number of instances are running and can scale based on CloudWatch metrics or schedules.

Core Mechanism: Uses scaling policies tied to CloudWatch alarms (e.g., average CPU > 70%). Can use simple, step, or target tracking policies.
Integration: Often used with Application Load Balancers (ALB) to distribute traffic across the scaled instances. Essential for scaling monolithic inference servers or VM-based serving platforms.
Use Case: Scaling inference workloads deployed on EC2 instances, particularly when not using a container orchestration layer or for stateful services where pod-based scaling is less suitable.

EXPLORE

Google Cloud Managed Instance Groups

Google Cloud Managed Instance Groups (MIGs) are the primary auto-scaling resource for Compute Engine VM instances on GCP. They provide automatic scaling, healing, and updating for groups of identical instances.

Core Mechanism: Scales based on CPU utilization, load balancing capacity, or custom Cloud Monitoring metrics. Supports predictive autoscaling using historical data.
Integration: Seamlessly works with Google Cloud Load Balancing for traffic distribution. Offers stateful MIGs for workloads that require persistent local disks.
Use Case: Auto-scaling VM-based model serving endpoints on GCP, or as the underlying infrastructure for GKE node pools (which themselves host scalable pods).

EXPLORE

Azure Virtual Machine Scale Sets

Azure Virtual Machine Scale Sets are an Azure Compute service for deploying and managing a set of identical, load-balanced VMs. The number of VM instances can automatically increase or decrease in response to demand or a defined schedule.

Core Mechanism: Scales based on metrics like average CPU percentage, available via Azure Monitor. Uses scaling rules with cool-down periods to prevent rapid oscillation.
Integration: Integrated with Azure Load Balancer or Application Gateway. Often used to scale the node pools for Azure Kubernetes Service (AKS) clusters.
Use Case: Scaling inference services running directly on Azure VMs or providing the elastic compute backbone for containerized services in AKS.

EXPLORE

Knative Serving

Knative Serving is a Kubernetes-based platform for deploying and managing serverless workloads. Its key feature is request-driven autoscaling, scaling pods to zero when idle and rapidly up based on HTTP request concurrency.

Core Mechanism: Uses the Knative Pod Autoscaler (KPA) which scales based on the number of concurrent HTTP requests per pod (default target of 100). Can also use the standard Kubernetes HPA for CPU-based scaling.
Scale-to-Zero: Automatically scales pods to zero after a configurable grace period, eliminating resource costs for sporadically used inference endpoints.
Use Case: Ideal for serverless inference patterns, event-triggered batch scoring, or APIs with highly variable, spiky traffic where cost optimization is critical.

EXPLORE

Cloud Provider Serverless Functions

Cloud serverless functions (AWS Lambda, Google Cloud Functions, Azure Functions) represent the extreme of auto-scaling, abstracting away servers entirely. Scaling is fully managed, instantaneous, and granular to the individual request.

Core Mechanism: The cloud provider's platform automatically provisions and scales the underlying execution environment. Scaling is based purely on the rate of incoming invocation events (HTTP requests, queue messages).
Constraints: Typically have strict limits on runtime duration, memory, and temporary disk space. Cold starts can add significant latency, a critical concern for inference.
Use Case: Lightweight, stateless inference tasks with small models, preprocessing/postprocessing logic, or as orchestrators for larger inference pipelines. Best for workloads tolerant of variable latency.

EXPLORE

AUTO-SCALING

Frequently Asked Questions

Auto-scaling is a fundamental capability for cost-effective and resilient model serving. These questions address how it works, its benefits, and key implementation considerations for ML Ops and DevOps teams.

Auto-scaling is the automated process of dynamically adjusting the number of compute resources (e.g., virtual machine instances, Kubernetes pods) allocated to a model serving application based on real-time demand metrics. It works by continuously monitoring a predefined set of performance metrics—such as CPU utilization, memory usage, GPU utilization, or request rate—and comparing them against configured thresholds. When a metric exceeds a scaling threshold for a sustained period, the orchestrator (like Kubernetes Horizontal Pod Autoscaler or a cloud provider's service) triggers a scaling action to add or remove replicas of the inference service.

For example, a configuration might specify that the average CPU utilization across all pods should be maintained at 70%. If traffic increases and the average climbs to 85%, the autoscaler will instruct the cluster to deploy additional pods to share the load. Conversely, if utilization drops to 50%, it will terminate unnecessary pods to reduce costs. This creates an elastic infrastructure that matches capacity to workload.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Auto-Scaling

What is Auto-Scaling?