Inferensys

Glossary

PagedAttention

PagedAttention is an algorithm that manages the Key-Value (KV) cache in transformer models using virtual memory paging concepts to eliminate memory fragmentation and waste.
Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.
INFERENCE OPTIMIZATION

What is PagedAttention?

PagedAttention is a memory management algorithm for transformer inference that adapts operating system virtual memory concepts to efficiently handle the variable-length Key-Value (KV) cache.

PagedAttention is an algorithm, introduced by the vLLM inference engine, that manages the Key-Value (KV) cache of transformer attention mechanisms using virtual memory paging concepts. It treats the KV cache as non-contiguous blocks, or 'pages,' in GPU memory. This eliminates the massive memory waste and fragmentation caused by pre-allocating large, fixed-size contiguous blocks for variable-length sequences, a common bottleneck in high-throughput LLM serving.

By enabling dynamic allocation and sharing of these memory pages across different sequences in a batch, PagedAttention drastically increases GPU memory utilization. This allows serving systems to support significantly higher concurrent request loads without out-of-memory errors, directly improving throughput while maintaining low latency. The algorithm is foundational to vLLM's performance and is a key technique in modern inference optimization stacks.

PAGEDATTENTION

Key Features and Benefits

PagedAttention is a memory management algorithm for the Key-Value (KV) cache in transformer-based language models. It applies virtual memory paging concepts to eliminate fragmentation and waste, enabling high-throughput, low-latency serving of variable-length sequences.

01

Eliminates KV Cache Fragmentation

Traditional KV cache allocation reserves contiguous memory blocks for each request's maximum possible sequence length, leading to significant internal fragmentation as most sequences are shorter than the maximum. PagedAttention divides the KV cache into fixed-size blocks (analogous to memory pages). These blocks are allocated non-contiguously only as needed for the actual sequence, virtually eliminating wasted memory. This allows more concurrent requests to fit into GPU VRAM, directly increasing throughput.

02

Enables Efficient Continuous Batching

PagedAttention's block-based management is the foundation for high-efficiency continuous batching (also known as in-flight batching). Because sequences are composed of independent blocks, the scheduler can:

  • Dynamically add new requests to a running batch.
  • Evict and store blocks of completed sequences to CPU RAM.
  • Swap blocks back in to GPU memory if a paused sequence (e.g., in chatbot turn-based dialogue) needs to resume generation. This maximizes GPU utilization by keeping the computational pipeline saturated, even with highly variable request arrival times and sequence lengths.
03

Supports Advanced Memory Sharing

The algorithm enables shared memory blocks across different sequences or within a single sequence. This is critical for two optimizations:

  • Shared Prompt Prefixes: In multi-user scenarios where requests share a common system prompt or context, PagedAttention allows a single physical block of KV cache for the shared prefix to be referenced by all relevant sequences, providing dramatic memory savings.
  • Parallel Sampling: When generating multiple output candidates (e.g., beam search or top-k sampling) from the same input, the shared context's KV cache is not duplicated. This reduces memory pressure during advanced decoding strategies.
04

Reduces Out-of-Memory Errors & Enables Longer Contexts

By packing KV cache blocks densely and sharing memory, PagedAttention drastically reduces the incidence of out-of-memory (OOM) errors during high-concurrency serving. The efficient memory use also makes it feasible to serve models with very long context windows (e.g., 128K or 1M tokens) without requiring prohibitive amounts of GPU VRAM per request. This allows practical deployment of models capable of processing large documents, lengthy conversations, or complex codebases.

06

Decouples Logical & Physical Layout

PagedAttention introduces a level of indirection between the logical sequence of tokens and their physical storage in memory. A block table (similar to a page table) maps the logical positions of tokens to their actual physical block addresses. This decoupling is what enables all other features:

  • Non-contiguous allocation prevents fragmentation.
  • Easy sharing of physical blocks across logical sequences.
  • Efficient swapping of blocks between GPU and CPU memory. This design mirrors operating system principles, applying proven systems techniques to the specific problem of attention cache management.
KV CACHE OPTIMIZATION

PagedAttention vs. Traditional KV Cache Management

A technical comparison of memory management strategies for the Key-Value (KV) cache during autoregressive LLM inference, focusing on efficiency, fragmentation, and throughput.

Feature / MetricTraditional KV Cache (Static Allocation)PagedAttention (Dynamic Paging)

Core Management Paradigm

Static, contiguous pre-allocation per sequence

Dynamic, non-contiguous paging using virtual memory concepts

Memory Allocation Unit

Entire sequence length (max_tokens)

Fixed-size blocks (e.g., 16 tokens)

Memory Fragmentation

High (internal & external)

Near-zero

Memory Waste for Variable-Length Sequences

Significant (unused pre-allocated slots)

Minimal (allocates only used blocks)

Support for Advanced Sampling (e.g., beam search)

Inefficient, requires separate cache copies

Efficient, shares physical blocks via virtual mapping

Effect on System Throughput (QPS)

Lower due to memory bottleneck

Higher (vLLM reports up to 24x improvement)

Implementation Complexity

Lower (simple tensor allocation)

Higher (requires block table & allocator)

Optimal Use Case

Uniform, predictable sequence lengths

Highly variable, unpredictable sequence lengths (e.g., chat, long document processing)

PAGEDATTENTION

Frequently Asked Questions

PagedAttention is a foundational algorithm for optimizing large language model inference. These questions address its core mechanisms, benefits, and role in modern serving systems.

PagedAttention is an algorithm that manages the Key-Value (KV) cache in transformer-based language models using concepts borrowed from virtual memory paging in operating systems. During autoregressive decoding, each generated token requires storing its key and value vectors in GPU memory for all previous tokens in the sequence, leading to the KV cache. PagedAttention divides this cache into fixed-size blocks (pages). Instead of allocating contiguous memory for each request's variable-length sequence—which causes memory fragmentation—it allocates non-contiguous blocks on demand. A centralized block table maps the logical sequence of tokens to their physical memory pages, allowing for efficient sharing of blocks between sequences (e.g., in a shared prompt) and compact storage of variable-length sequences. This eliminates internal fragmentation and allows previously wasted memory to be used for additional concurrent requests, dramatically increasing throughput.

Key Mechanism: When a request needs to store KV vectors for a new token, the memory allocator assigns it to the next available slot in the current block. If the block is full, a new block is allocated. The block table tracks this mapping, enabling non-contiguous storage while preserving the logical order for the attention computation.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.