Continuous batching, also known as iterative batching or in-flight batching, is a server-side optimization where new inference requests are dynamically added to a running batch as previous requests finish generating tokens. Unlike static batching, which waits for an entire batch to complete before processing new requests, this method treats sequences within a batch as independent processes. The server continuously schedules and executes the next token generation step only for the sequences that are still active, allowing finished requests to exit the batch and new ones to join immediately, thereby eliminating GPU idle time.
Glossary
Continuous Batching

What is Continuous Batching?
Continuous batching is an advanced inference optimization technique for autoregressive models like large language models (LLMs) that dramatically increases GPU utilization and throughput.
This technique is fundamental to high-performance inference servers like vLLM and Text Generation Inference (TGI). It works in tandem with the Key-Value (KV) Cache, where memory for finished sequences is efficiently reclaimed. The primary benefit is a significant increase in throughput (tokens/second) and hardware utilization, especially for workloads with variable sequence lengths and arrival times, making it a cornerstone of cost-effective LLM serving in production environments.
Key Features and Benefits
Continuous batching fundamentally rethinks request processing for autoregressive models. Instead of waiting for a full batch to complete, it dynamically manages a pool of requests, leading to significant performance gains.
Iterative Request Scheduling
Unlike static batching, which waits for all sequences in a batch to finish generation, continuous batching adds new requests to a running batch as previous ones complete. This is often called iteration-level scheduling or incremental batching. The scheduler manages a pool of active requests, and at each model forward pass, it only computes tokens for requests that are still generating, eliminating idle GPU cycles.
- Dynamic Pool: New queries join the active pool immediately.
- Finished Requests: Completed sequences are removed, and their GPU memory is freed for new ones.
- Higher Utilization: GPUs are kept consistently busy, dramatically improving throughput (tokens/second).
PagedAttention & Memory Optimization
A major bottleneck in LLM inference is managing the Key-Value (KV) Cache. Continuous batching is enabled by advanced memory management like PagedAttention (used in vLLM). This technique treats the KV cache like virtual memory:
- Non-Contiguous Blocks: KV cache is stored in fixed-size blocks, not per-sequence contiguous memory.
- Eliminates Internal Fragmentation: Prevents wasted memory from padding variable-length sequences.
- Efficient Sharing: Allows for memory sharing between similar prompts in advanced scenarios. This allows the system to efficiently allocate and deallocate memory for sequences as they start and finish, which is critical for maintaining high batch sizes with variable-length outputs.
Improved Hardware Utilization & Throughput
The primary engineering benefit is maximizing the use of expensive GPU resources. By keeping the computational units (SMs) saturated, continuous batching can achieve 2-10x higher throughput compared to static batching, especially for workloads with variable request lengths and arrival times.
- Reduces Tail Latency: Prevents short requests from waiting behind long ones in a static batch.
- Ideal for Chat & Streaming: Perfectly suits interactive applications where requests arrive asynchronously.
- Cost-Effective Inference: Higher throughput directly translates to lower cost per token, a key metric for CTOs.
First Token vs. Next Token Latency
Continuous batching optimizes two critical latency metrics differently:
- Time to First Token (TTFT): Often improved because requests can begin processing immediately upon arrival without waiting to form a large batch. This is crucial for user-perceived responsiveness.
- Time per Output Token (TPOT): Also known as inter-token latency, this is optimized because the batch composition is always full of active sequences, keeping GPU utilization high throughout generation. Understanding this trade-off is essential for tuning serving parameters to meet specific application Service Level Objectives (SLOs).
Implementation in Serving Engines
Continuous batching is a core feature of modern, high-performance LLM serving engines. It is not typically a simple configuration flag but a fundamental architectural choice.
- vLLM: Implements it via its PagedAttention kernel.
- Text Generation Inference (TGI): Uses a continuous batching algorithm often referred to as "iteration-level batching" or "in-flight batching."
- NVIDIA Triton Inference Server: Supports dynamic batching, which can be configured for continuous behavior with LLMs, though its efficiency depends on the backend framework. These engines handle the complex scheduling, memory management, and attention masking required to make continuous batching work correctly.
Contrast with Dynamic Batching
It's important to distinguish continuous batching from the more general dynamic batching:
- Dynamic Batching: Collects requests over a short time window (e.g., 10ms) to form a batch, then processes that entire batch to completion. Requests wait at the start.
- Continuous Batching: Has no fixed batch boundary. The set of requests being processed evolves with each decoding step. This is true iteration-level scheduling. Continuous batching is a stricter, more aggressive form of dynamic batching specifically designed for the autoregressive decoding loop of LLMs and is essential for achieving state-of-the-art serving efficiency.
Continuous Batching vs. Static & Dynamic Batching
A technical comparison of batching strategies for serving autoregressive language models, focusing on GPU utilization, latency, and throughput.
| Feature / Metric | Static Batching | Dynamic Batching | Continuous Batching (Iterative Batching) |
|---|---|---|---|
Batch Formation | Fixed at request arrival. All requests in a batch must finish generation together. | Dynamic grouping based on arrival time and sequence length before processing starts. | Continuous. New requests are added to a running batch as previous requests finish generation. |
GPU Utilization | Low to moderate. GPU idles during padding and while waiting for the slowest request in the batch. | Moderate. Reduces idle time from padding but still suffers from straggler requests. | High to very high. Maximizes GPU occupancy by continuously feeding it new tokens. |
Latency Profile (Time to First Token) | High and variable. All requests wait for the batch to fill before any generation starts. | Reduced. Batches form more quickly, but requests still wait for batch formation. | Low and consistent. Requests begin generation immediately upon arrival into the running batch. |
Latency Profile (End-to-End) | High. Dictated by the slowest (longest) request in the batch. | Moderate. Improved over static but still impacted by stragglers. | Optimal. Individual requests finish and exit the batch independently, minimizing tail latency. |
Throughput (Tokens/sec) | Low. Inefficient use of compute due to padding and idle time. | Moderate. Better than static batching. | High. Can achieve 5x-10x improvements over static batching for LLM inference. |
Handles Variable Sequence Lengths | |||
Eliminates Padding Waste | |||
Implementation Complexity | Low. Simple to implement. | Moderate. Requires scheduling logic. | High. Requires sophisticated memory management (e.g., PagedAttention) and scheduling. |
Ideal Use Case | Offline batch processing with uniform sequence lengths. | Online serving with moderate latency requirements. | High-throughput, low-latency online serving of autoregressive models (LLMs). |
Key Enabling Technology | N/A | Dynamic batch schedulers in inference servers. | PagedAttention (vLLM), Orca-style iteration scheduling, TGI's continuous batching. |
Implementations and Frameworks
Continuous batching is implemented in specialized inference servers and frameworks designed to maximize hardware utilization for autoregressive text generation. These systems manage the complex orchestration of variable-length sequences and dynamic resource allocation.
Frequently Asked Questions
Continuous batching is a critical inference optimization for autoregressive models like LLMs. These questions address its core mechanisms, benefits, and implementation for production serving.
Continuous batching, also known as iterative batching or in-flight batching, is an inference optimization technique where new requests are dynamically added to a running batch as previous requests finish generation, rather than waiting for an entire batch to complete before starting a new one. It works by treating the batch as a mutable set of sequences. As each sequence in the batch generates its next token, completed sequences are removed from the batch and their results are returned. Simultaneously, new incoming requests are slotted into the newly freed space within the same batch iteration. This creates a pipeline where the batch composition changes continuously, leading to near-100% GPU utilization even under variable request loads and sequence lengths. This is a fundamental shift from static batching, which suffers from padding inefficiency and idle time.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Continuous batching is a core component of modern, high-throughput inference serving. These related concepts define the surrounding architecture and complementary optimization techniques.
Dynamic Batching
The foundational technique where an inference server groups multiple incoming requests into a single batch for parallel GPU processing. Unlike static batching, it forms batches dynamically based on arrival time and sequence length to improve hardware utilization and reduce latency.
- Key Mechanism: A scheduler collects requests in a queue for a short window before sending them to the model as a batch.
- Trade-off: Balances increased throughput against the latency of waiting for the batch to fill.
Key-Value (KV) Cache
A critical memory optimization for autoregressive transformer inference. It stores the computed key and value tensors for all previously generated tokens in a sequence, preventing their recalculation at each generation step.
- Impact on Batching: The KV cache is a primary consumer of GPU memory during batched inference. Its efficient management is essential for continuous batching.
- PagedAttention: Advanced systems like vLLM use this technique to manage the KV cache in non-contiguous memory blocks, drastically reducing fragmentation and allowing larger batch sizes.
Inference Latency vs. Throughput
The fundamental trade-off optimized by batching strategies. Throughput measures the number of requests processed per second, while Latency measures the time to complete a single request.
- Static Batching: Maximizes throughput for fixed workloads but harms tail latency (the slowest requests).
- Continuous Batching: Aims to improve both metrics by keeping the GPU saturated (high throughput) while allowing new requests to join immediately (better latency).
- Engineering Goal: The design of systems like vLLM and TGI is to push the Pareto frontier, achieving higher throughput without proportional latency increases.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us