Inferensys

Glossary

Continuous Batching

Continuous batching is an advanced inference optimization technique for autoregressive models where new requests are dynamically added to a running batch as previous requests finish generation, maximizing GPU utilization and throughput.
Developer testing AI inference on mobile phone in hand, laptop with optimization code visible, casual tech review moment.
INFERENCE OPTIMIZATION

What is Continuous Batching?

Continuous batching is an advanced inference optimization technique for autoregressive models like large language models (LLMs) that dramatically increases GPU utilization and throughput.

Continuous batching, also known as iterative batching or in-flight batching, is a server-side optimization where new inference requests are dynamically added to a running batch as previous requests finish generating tokens. Unlike static batching, which waits for an entire batch to complete before processing new requests, this method treats sequences within a batch as independent processes. The server continuously schedules and executes the next token generation step only for the sequences that are still active, allowing finished requests to exit the batch and new ones to join immediately, thereby eliminating GPU idle time.

This technique is fundamental to high-performance inference servers like vLLM and Text Generation Inference (TGI). It works in tandem with the Key-Value (KV) Cache, where memory for finished sequences is efficiently reclaimed. The primary benefit is a significant increase in throughput (tokens/second) and hardware utilization, especially for workloads with variable sequence lengths and arrival times, making it a cornerstone of cost-effective LLM serving in production environments.

CORE MECHANICS

Key Features and Benefits

Continuous batching fundamentally rethinks request processing for autoregressive models. Instead of waiting for a full batch to complete, it dynamically manages a pool of requests, leading to significant performance gains.

01

Iterative Request Scheduling

Unlike static batching, which waits for all sequences in a batch to finish generation, continuous batching adds new requests to a running batch as previous ones complete. This is often called iteration-level scheduling or incremental batching. The scheduler manages a pool of active requests, and at each model forward pass, it only computes tokens for requests that are still generating, eliminating idle GPU cycles.

  • Dynamic Pool: New queries join the active pool immediately.
  • Finished Requests: Completed sequences are removed, and their GPU memory is freed for new ones.
  • Higher Utilization: GPUs are kept consistently busy, dramatically improving throughput (tokens/second).
02

PagedAttention & Memory Optimization

A major bottleneck in LLM inference is managing the Key-Value (KV) Cache. Continuous batching is enabled by advanced memory management like PagedAttention (used in vLLM). This technique treats the KV cache like virtual memory:

  • Non-Contiguous Blocks: KV cache is stored in fixed-size blocks, not per-sequence contiguous memory.
  • Eliminates Internal Fragmentation: Prevents wasted memory from padding variable-length sequences.
  • Efficient Sharing: Allows for memory sharing between similar prompts in advanced scenarios. This allows the system to efficiently allocate and deallocate memory for sequences as they start and finish, which is critical for maintaining high batch sizes with variable-length outputs.
03

Improved Hardware Utilization & Throughput

The primary engineering benefit is maximizing the use of expensive GPU resources. By keeping the computational units (SMs) saturated, continuous batching can achieve 2-10x higher throughput compared to static batching, especially for workloads with variable request lengths and arrival times.

  • Reduces Tail Latency: Prevents short requests from waiting behind long ones in a static batch.
  • Ideal for Chat & Streaming: Perfectly suits interactive applications where requests arrive asynchronously.
  • Cost-Effective Inference: Higher throughput directly translates to lower cost per token, a key metric for CTOs.
04

First Token vs. Next Token Latency

Continuous batching optimizes two critical latency metrics differently:

  • Time to First Token (TTFT): Often improved because requests can begin processing immediately upon arrival without waiting to form a large batch. This is crucial for user-perceived responsiveness.
  • Time per Output Token (TPOT): Also known as inter-token latency, this is optimized because the batch composition is always full of active sequences, keeping GPU utilization high throughout generation. Understanding this trade-off is essential for tuning serving parameters to meet specific application Service Level Objectives (SLOs).
05

Implementation in Serving Engines

Continuous batching is a core feature of modern, high-performance LLM serving engines. It is not typically a simple configuration flag but a fundamental architectural choice.

  • vLLM: Implements it via its PagedAttention kernel.
  • Text Generation Inference (TGI): Uses a continuous batching algorithm often referred to as "iteration-level batching" or "in-flight batching."
  • NVIDIA Triton Inference Server: Supports dynamic batching, which can be configured for continuous behavior with LLMs, though its efficiency depends on the backend framework. These engines handle the complex scheduling, memory management, and attention masking required to make continuous batching work correctly.
06

Contrast with Dynamic Batching

It's important to distinguish continuous batching from the more general dynamic batching:

  • Dynamic Batching: Collects requests over a short time window (e.g., 10ms) to form a batch, then processes that entire batch to completion. Requests wait at the start.
  • Continuous Batching: Has no fixed batch boundary. The set of requests being processed evolves with each decoding step. This is true iteration-level scheduling. Continuous batching is a stricter, more aggressive form of dynamic batching specifically designed for the autoregressive decoding loop of LLMs and is essential for achieving state-of-the-art serving efficiency.
INFERENCE BATCHING COMPARISON

Continuous Batching vs. Static & Dynamic Batching

A technical comparison of batching strategies for serving autoregressive language models, focusing on GPU utilization, latency, and throughput.

Feature / MetricStatic BatchingDynamic BatchingContinuous Batching (Iterative Batching)

Batch Formation

Fixed at request arrival. All requests in a batch must finish generation together.

Dynamic grouping based on arrival time and sequence length before processing starts.

Continuous. New requests are added to a running batch as previous requests finish generation.

GPU Utilization

Low to moderate. GPU idles during padding and while waiting for the slowest request in the batch.

Moderate. Reduces idle time from padding but still suffers from straggler requests.

High to very high. Maximizes GPU occupancy by continuously feeding it new tokens.

Latency Profile (Time to First Token)

High and variable. All requests wait for the batch to fill before any generation starts.

Reduced. Batches form more quickly, but requests still wait for batch formation.

Low and consistent. Requests begin generation immediately upon arrival into the running batch.

Latency Profile (End-to-End)

High. Dictated by the slowest (longest) request in the batch.

Moderate. Improved over static but still impacted by stragglers.

Optimal. Individual requests finish and exit the batch independently, minimizing tail latency.

Throughput (Tokens/sec)

Low. Inefficient use of compute due to padding and idle time.

Moderate. Better than static batching.

High. Can achieve 5x-10x improvements over static batching for LLM inference.

Handles Variable Sequence Lengths

Eliminates Padding Waste

Implementation Complexity

Low. Simple to implement.

Moderate. Requires scheduling logic.

High. Requires sophisticated memory management (e.g., PagedAttention) and scheduling.

Ideal Use Case

Offline batch processing with uniform sequence lengths.

Online serving with moderate latency requirements.

High-throughput, low-latency online serving of autoregressive models (LLMs).

Key Enabling Technology

N/A

Dynamic batch schedulers in inference servers.

PagedAttention (vLLM), Orca-style iteration scheduling, TGI's continuous batching.

CONTINUOUS BATCHING

Implementations and Frameworks

Continuous batching is implemented in specialized inference servers and frameworks designed to maximize hardware utilization for autoregressive text generation. These systems manage the complex orchestration of variable-length sequences and dynamic resource allocation.

CONTINUOUS BATCHING

Frequently Asked Questions

Continuous batching is a critical inference optimization for autoregressive models like LLMs. These questions address its core mechanisms, benefits, and implementation for production serving.

Continuous batching, also known as iterative batching or in-flight batching, is an inference optimization technique where new requests are dynamically added to a running batch as previous requests finish generation, rather than waiting for an entire batch to complete before starting a new one. It works by treating the batch as a mutable set of sequences. As each sequence in the batch generates its next token, completed sequences are removed from the batch and their results are returned. Simultaneously, new incoming requests are slotted into the newly freed space within the same batch iteration. This creates a pipeline where the batch composition changes continuously, leading to near-100% GPU utilization even under variable request loads and sequence lengths. This is a fundamental shift from static batching, which suffers from padding inefficiency and idle time.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.