Glossary

PagedAttention

PagedAttention is an algorithm that manages the Key-Value (KV) cache in transformer models using virtual memory paging concepts to eliminate memory fragmentation and waste.

Get in touch Learn more

Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.

INFERENCE OPTIMIZATION

What is PagedAttention?

PagedAttention is a memory management algorithm for transformer inference that adapts operating system virtual memory concepts to efficiently handle the variable-length Key-Value (KV) cache.

PagedAttention is an algorithm, introduced by the vLLM inference engine, that manages the Key-Value (KV) cache of transformer attention mechanisms using virtual memory paging concepts. It treats the KV cache as non-contiguous blocks, or 'pages,' in GPU memory. This eliminates the massive memory waste and fragmentation caused by pre-allocating large, fixed-size contiguous blocks for variable-length sequences, a common bottleneck in high-throughput LLM serving.

By enabling dynamic allocation and sharing of these memory pages across different sequences in a batch, PagedAttention drastically increases GPU memory utilization. This allows serving systems to support significantly higher concurrent request loads without out-of-memory errors, directly improving throughput while maintaining low latency. The algorithm is foundational to vLLM's performance and is a key technique in modern inference optimization stacks.

PAGEDATTENTION

Key Features and Benefits

PagedAttention is a memory management algorithm for the Key-Value (KV) cache in transformer-based language models. It applies virtual memory paging concepts to eliminate fragmentation and waste, enabling high-throughput, low-latency serving of variable-length sequences.

Eliminates KV Cache Fragmentation

Traditional KV cache allocation reserves contiguous memory blocks for each request's maximum possible sequence length, leading to significant internal fragmentation as most sequences are shorter than the maximum. PagedAttention divides the KV cache into fixed-size blocks (analogous to memory pages). These blocks are allocated non-contiguously only as needed for the actual sequence, virtually eliminating wasted memory. This allows more concurrent requests to fit into GPU VRAM, directly increasing throughput.

Enables Efficient Continuous Batching

PagedAttention's block-based management is the foundation for high-efficiency continuous batching (also known as in-flight batching). Because sequences are composed of independent blocks, the scheduler can:

Dynamically add new requests to a running batch.
Evict and store blocks of completed sequences to CPU RAM.
Swap blocks back in to GPU memory if a paused sequence (e.g., in chatbot turn-based dialogue) needs to resume generation. This maximizes GPU utilization by keeping the computational pipeline saturated, even with highly variable request arrival times and sequence lengths.

Supports Advanced Memory Sharing

The algorithm enables shared memory blocks across different sequences or within a single sequence. This is critical for two optimizations:

Shared Prompt Prefixes: In multi-user scenarios where requests share a common system prompt or context, PagedAttention allows a single physical block of KV cache for the shared prefix to be referenced by all relevant sequences, providing dramatic memory savings.
Parallel Sampling: When generating multiple output candidates (e.g., beam search or top-k sampling) from the same input, the shared context's KV cache is not duplicated. This reduces memory pressure during advanced decoding strategies.

Reduces Out-of-Memory Errors & Enables Longer Contexts

By packing KV cache blocks densely and sharing memory, PagedAttention drastically reduces the incidence of out-of-memory (OOM) errors during high-concurrency serving. The efficient memory use also makes it feasible to serve models with very long context windows (e.g., 128K or 1M tokens) without requiring prohibitive amounts of GPU VRAM per request. This allows practical deployment of models capable of processing large documents, lengthy conversations, or complex codebases.

Architectural Foundation for vLLM

PagedAttention is the core innovation powering the vLLM inference serving engine. vLLM's high-throughput performance, often cited as achieving near-optimal GPU utilization and throughput that is up to 24x higher than prior systems, is directly attributable to this memory management scheme. The algorithm transforms the KV cache from a static, wasteful resource into a dynamic, efficiently managed pool, enabling vLLM to serve as a high-performance backend for LLM APIs.

EXPLORE

Decouples Logical & Physical Layout

PagedAttention introduces a level of indirection between the logical sequence of tokens and their physical storage in memory. A block table (similar to a page table) maps the logical positions of tokens to their actual physical block addresses. This decoupling is what enables all other features:

Non-contiguous allocation prevents fragmentation.
Easy sharing of physical blocks across logical sequences.
Efficient swapping of blocks between GPU and CPU memory. This design mirrors operating system principles, applying proven systems techniques to the specific problem of attention cache management.

KV CACHE OPTIMIZATION

PagedAttention vs. Traditional KV Cache Management

A technical comparison of memory management strategies for the Key-Value (KV) cache during autoregressive LLM inference, focusing on efficiency, fragmentation, and throughput.

Feature / Metric	Traditional KV Cache (Static Allocation)	PagedAttention (Dynamic Paging)
Core Management Paradigm	Static, contiguous pre-allocation per sequence	Dynamic, non-contiguous paging using virtual memory concepts
Memory Allocation Unit	Entire sequence length (max_tokens)	Fixed-size blocks (e.g., 16 tokens)
Memory Fragmentation	High (internal & external)	Near-zero
Memory Waste for Variable-Length Sequences	Significant (unused pre-allocated slots)	Minimal (allocates only used blocks)
Support for Advanced Sampling (e.g., beam search)	Inefficient, requires separate cache copies	Efficient, shares physical blocks via virtual mapping
Effect on System Throughput (QPS)	Lower due to memory bottleneck	Higher (vLLM reports up to 24x improvement)
Implementation Complexity	Lower (simple tensor allocation)	Higher (requires block table & allocator)
Optimal Use Case	Uniform, predictable sequence lengths	Highly variable, unpredictable sequence lengths (e.g., chat, long document processing)

PAGEDATTENTION

Frequently Asked Questions

PagedAttention is a foundational algorithm for optimizing large language model inference. These questions address its core mechanisms, benefits, and role in modern serving systems.

PagedAttention is an algorithm that manages the Key-Value (KV) cache in transformer-based language models using concepts borrowed from virtual memory paging in operating systems. During autoregressive decoding, each generated token requires storing its key and value vectors in GPU memory for all previous tokens in the sequence, leading to the KV cache. PagedAttention divides this cache into fixed-size blocks (pages). Instead of allocating contiguous memory for each request's variable-length sequence—which causes memory fragmentation—it allocates non-contiguous blocks on demand. A centralized block table maps the logical sequence of tokens to their physical memory pages, allowing for efficient sharing of blocks between sequences (e.g., in a shared prompt) and compact storage of variable-length sequences. This eliminates internal fragmentation and allows previously wasted memory to be used for additional concurrent requests, dramatically increasing throughput.

Key Mechanism: When a request needs to store KV vectors for a new token, the memory allocator assigns it to the next available slot in the current block. If the block is full, a new block is allocated. The block table tracks this mapping, enabling non-contiguous storage while preserving the logical order for the attention computation.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

INFERENCE OPTIMIZATION

Related Terms

PagedAttention is a core component of modern high-performance LLM serving. These related concepts define the broader ecosystem of techniques and metrics for optimizing inference latency and throughput.

Continuous Batching

Also known as dynamic batching, this is an inference optimization technique where new requests are dynamically added to a running batch on the GPU as previous requests finish generation. This maximizes GPU utilization and throughput by eliminating idle time, contrasting with static batching where the entire batch must finish before a new one starts. It is a foundational technique used in conjunction with PagedAttention in engines like vLLM.

Key-Value (KV) Cache

A memory structure used during the autoregressive decoding of Transformer models. It stores the computed keys and values for previously generated tokens, allowing the model to attend to its own output without recomputing these tensors for every new token. The KV cache is the primary source of memory fragmentation and waste that PagedAttention was designed to solve, as its size grows dynamically with sequence length.

vLLM

A high-throughput, memory-efficient open-source inference and serving engine for large language models. It is the original implementation and primary production system for the PagedAttention algorithm. vLLM's architecture is designed to maximize GPU utilization and support high concurrent request loads, making it a benchmark for LLM serving performance. Its design directly addresses the bottlenecks of KV cache management.

EXPLORE

Memory Fragmentation

A condition where free memory is broken into small, non-contiguous blocks, preventing the allocation of larger contiguous blocks even if total free memory is sufficient. In LLM inference, this occurs because variable-length sequences cause the KV cache for each request to be allocated and freed at different times. PagedAttention eliminates this by using a virtual memory paging approach, allowing non-contiguous physical memory blocks to serve a logically contiguous cache.

Speculative Decoding

An inference acceleration technique that reduces the number of slow, sequential autoregressive steps from a large target model. A small, fast draft model (or a simpler heuristic) proposes a short sequence of candidate tokens. The large target model then verifies this sequence in a single, parallel forward pass, accepting correct tokens and rejecting incorrect ones. This technique improves Time Per Output Token (TPOT) and is often used alongside memory optimizations like PagedAttention.

Operator Fusion

A compiler-level inference optimization that combines multiple sequential neural network operations into a single GPU kernel. For example, a pattern like Linear -> Bias Add -> Activation can be fused. This reduces GPU kernel launch overhead and intermediate memory reads/writes, lowering latency. While PagedAttention optimizes memory management, operator fusion (used in engines like TensorRT) optimizes the computation graph itself.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.