PagedAttention is an algorithm, introduced by the vLLM inference engine, that manages the Key-Value (KV) cache of transformer attention mechanisms using virtual memory paging concepts. It treats the KV cache as non-contiguous blocks, or 'pages,' in GPU memory. This eliminates the massive memory waste and fragmentation caused by pre-allocating large, fixed-size contiguous blocks for variable-length sequences, a common bottleneck in high-throughput LLM serving.
Glossary
PagedAttention

What is PagedAttention?
PagedAttention is a memory management algorithm for transformer inference that adapts operating system virtual memory concepts to efficiently handle the variable-length Key-Value (KV) cache.
By enabling dynamic allocation and sharing of these memory pages across different sequences in a batch, PagedAttention drastically increases GPU memory utilization. This allows serving systems to support significantly higher concurrent request loads without out-of-memory errors, directly improving throughput while maintaining low latency. The algorithm is foundational to vLLM's performance and is a key technique in modern inference optimization stacks.
Key Features and Benefits
PagedAttention is a memory management algorithm for the Key-Value (KV) cache in transformer-based language models. It applies virtual memory paging concepts to eliminate fragmentation and waste, enabling high-throughput, low-latency serving of variable-length sequences.
Eliminates KV Cache Fragmentation
Traditional KV cache allocation reserves contiguous memory blocks for each request's maximum possible sequence length, leading to significant internal fragmentation as most sequences are shorter than the maximum. PagedAttention divides the KV cache into fixed-size blocks (analogous to memory pages). These blocks are allocated non-contiguously only as needed for the actual sequence, virtually eliminating wasted memory. This allows more concurrent requests to fit into GPU VRAM, directly increasing throughput.
Enables Efficient Continuous Batching
PagedAttention's block-based management is the foundation for high-efficiency continuous batching (also known as in-flight batching). Because sequences are composed of independent blocks, the scheduler can:
- Dynamically add new requests to a running batch.
- Evict and store blocks of completed sequences to CPU RAM.
- Swap blocks back in to GPU memory if a paused sequence (e.g., in chatbot turn-based dialogue) needs to resume generation. This maximizes GPU utilization by keeping the computational pipeline saturated, even with highly variable request arrival times and sequence lengths.
Supports Advanced Memory Sharing
The algorithm enables shared memory blocks across different sequences or within a single sequence. This is critical for two optimizations:
- Shared Prompt Prefixes: In multi-user scenarios where requests share a common system prompt or context, PagedAttention allows a single physical block of KV cache for the shared prefix to be referenced by all relevant sequences, providing dramatic memory savings.
- Parallel Sampling: When generating multiple output candidates (e.g., beam search or top-k sampling) from the same input, the shared context's KV cache is not duplicated. This reduces memory pressure during advanced decoding strategies.
Reduces Out-of-Memory Errors & Enables Longer Contexts
By packing KV cache blocks densely and sharing memory, PagedAttention drastically reduces the incidence of out-of-memory (OOM) errors during high-concurrency serving. The efficient memory use also makes it feasible to serve models with very long context windows (e.g., 128K or 1M tokens) without requiring prohibitive amounts of GPU VRAM per request. This allows practical deployment of models capable of processing large documents, lengthy conversations, or complex codebases.
Decouples Logical & Physical Layout
PagedAttention introduces a level of indirection between the logical sequence of tokens and their physical storage in memory. A block table (similar to a page table) maps the logical positions of tokens to their actual physical block addresses. This decoupling is what enables all other features:
- Non-contiguous allocation prevents fragmentation.
- Easy sharing of physical blocks across logical sequences.
- Efficient swapping of blocks between GPU and CPU memory. This design mirrors operating system principles, applying proven systems techniques to the specific problem of attention cache management.
PagedAttention vs. Traditional KV Cache Management
A technical comparison of memory management strategies for the Key-Value (KV) cache during autoregressive LLM inference, focusing on efficiency, fragmentation, and throughput.
| Feature / Metric | Traditional KV Cache (Static Allocation) | PagedAttention (Dynamic Paging) |
|---|---|---|
Core Management Paradigm | Static, contiguous pre-allocation per sequence | Dynamic, non-contiguous paging using virtual memory concepts |
Memory Allocation Unit | Entire sequence length (max_tokens) | Fixed-size blocks (e.g., 16 tokens) |
Memory Fragmentation | High (internal & external) | Near-zero |
Memory Waste for Variable-Length Sequences | Significant (unused pre-allocated slots) | Minimal (allocates only used blocks) |
Support for Advanced Sampling (e.g., beam search) | Inefficient, requires separate cache copies | Efficient, shares physical blocks via virtual mapping |
Effect on System Throughput (QPS) | Lower due to memory bottleneck | Higher (vLLM reports up to 24x improvement) |
Implementation Complexity | Lower (simple tensor allocation) | Higher (requires block table & allocator) |
Optimal Use Case | Uniform, predictable sequence lengths | Highly variable, unpredictable sequence lengths (e.g., chat, long document processing) |
Frequently Asked Questions
PagedAttention is a foundational algorithm for optimizing large language model inference. These questions address its core mechanisms, benefits, and role in modern serving systems.
PagedAttention is an algorithm that manages the Key-Value (KV) cache in transformer-based language models using concepts borrowed from virtual memory paging in operating systems. During autoregressive decoding, each generated token requires storing its key and value vectors in GPU memory for all previous tokens in the sequence, leading to the KV cache. PagedAttention divides this cache into fixed-size blocks (pages). Instead of allocating contiguous memory for each request's variable-length sequence—which causes memory fragmentation—it allocates non-contiguous blocks on demand. A centralized block table maps the logical sequence of tokens to their physical memory pages, allowing for efficient sharing of blocks between sequences (e.g., in a shared prompt) and compact storage of variable-length sequences. This eliminates internal fragmentation and allows previously wasted memory to be used for additional concurrent requests, dramatically increasing throughput.
Key Mechanism: When a request needs to store KV vectors for a new token, the memory allocator assigns it to the next available slot in the current block. If the block is full, a new block is allocated. The block table tracks this mapping, enabling non-contiguous storage while preserving the logical order for the attention computation.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
PagedAttention is a core component of modern high-performance LLM serving. These related concepts define the broader ecosystem of techniques and metrics for optimizing inference latency and throughput.
Continuous Batching
Also known as dynamic batching, this is an inference optimization technique where new requests are dynamically added to a running batch on the GPU as previous requests finish generation. This maximizes GPU utilization and throughput by eliminating idle time, contrasting with static batching where the entire batch must finish before a new one starts. It is a foundational technique used in conjunction with PagedAttention in engines like vLLM.
Key-Value (KV) Cache
A memory structure used during the autoregressive decoding of Transformer models. It stores the computed keys and values for previously generated tokens, allowing the model to attend to its own output without recomputing these tensors for every new token. The KV cache is the primary source of memory fragmentation and waste that PagedAttention was designed to solve, as its size grows dynamically with sequence length.
Memory Fragmentation
A condition where free memory is broken into small, non-contiguous blocks, preventing the allocation of larger contiguous blocks even if total free memory is sufficient. In LLM inference, this occurs because variable-length sequences cause the KV cache for each request to be allocated and freed at different times. PagedAttention eliminates this by using a virtual memory paging approach, allowing non-contiguous physical memory blocks to serve a logically contiguous cache.
Speculative Decoding
An inference acceleration technique that reduces the number of slow, sequential autoregressive steps from a large target model. A small, fast draft model (or a simpler heuristic) proposes a short sequence of candidate tokens. The large target model then verifies this sequence in a single, parallel forward pass, accepting correct tokens and rejecting incorrect ones. This technique improves Time Per Output Token (TPOT) and is often used alongside memory optimizations like PagedAttention.
Operator Fusion
A compiler-level inference optimization that combines multiple sequential neural network operations into a single GPU kernel. For example, a pattern like Linear -> Bias Add -> Activation can be fused. This reduces GPU kernel launch overhead and intermediate memory reads/writes, lowering latency. While PagedAttention optimizes memory management, operator fusion (used in engines like TensorRT) optimizes the computation graph itself.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us