The Key-Value (KV) Cache is a memory buffer used during the autoregressive inference of transformer models (like GPT or Llama) that stores the computed key and value tensors for all previously generated tokens in a sequence. By caching these intermediate activations, the model avoids recalculating them for every new token, transforming the computational complexity of generating a sequence of length N from O(N²) to O(N), which results in dramatically faster text generation. This optimization is fundamental to the practical deployment of large language models.
Glossary
Key-Value (KV) Cache

What is Key-Value (KV) Cache?
A core optimization for transformer-based autoregressive generation, the KV Cache is a memory buffer that stores intermediate computations to eliminate redundant work.
The cache is managed per transformer layer and is specific to the attention mechanism. For each new token, the model computes only the query for the current position and performs attention against the cached keys and values from all prior positions. Efficient memory management of the KV Cache, such as through techniques like PagedAttention in vLLM, is critical for serving throughput, as the cache size grows linearly with both batch size and sequence length, often becoming the dominant memory consumer during inference.
Key Characteristics of the KV Cache
The Key-Value (KV) Cache is a critical optimization for transformer inference. These cards detail its core mechanisms, benefits, and operational considerations in production serving environments.
Core Mechanism: Autoregressive Computation Reuse
During autoregressive generation, a transformer model processes an input sequence token-by-token. For each new token, it recomputes the self-attention scores against all previous tokens. The KV Cache stores the computed key (K) and value (V) tensors for all previous positions in the sequence. When generating the next token, the model retrieves these cached tensors instead of recomputing them, avoiding redundant matrix multiplications for the historical context. This transforms the computational complexity of generating a sequence of length N from O(N²) to O(N), which is the primary source of its speedup.
Primary Benefit: Dramatic Latency Reduction
The KV Cache's main value is drastically reducing per-token latency during text generation. By eliminating the need to reprocess the entire growing context for each new token, inference becomes significantly faster. For example, generating a 100-token response with a model that has a 32-layer transformer might require ~3,200 fewer layer-wise attention computations for the historical context if caching is used. This directly translates to higher throughput (tokens/second) and lower response latency for end-users, which is a critical metric for production LLM APIs.
Memory Footprint & Management Challenge
The KV Cache introduces a significant, dynamic memory overhead. The cache size is proportional to:
- Batch Size (B): Number of parallel requests.
- Sequence Length (L): Total context length (prompt + generation).
- Number of Layers (N_l): Transformer blocks in the model.
- Hidden Dimension (D) & Attention Heads (H).
For a large model (e.g., Llama 3 70B with 80 layers, 64 heads, 128 head-dim), caching a 4096-token sequence for a batch of 8 can require ~60 GB of VRAM just for the cache. This necessitates advanced memory management techniques like PagedAttention (used in vLLM) or continuous batching to handle memory fragmentation and out-of-memory errors efficiently.
Integration with Continuous Batching
In production servers, the KV Cache is managed in tandem with continuous batching (also called iterative batching). This optimization allows new requests to be added to a running batch as previous requests finish generation. Each request in the batch has its own, logically separate KV Cache. The serving engine must:
- Allocate and deallocate cache memory dynamically per request.
- Maintain correct attention masking so requests within a batch don't attend to each other's caches.
- Handle variable sequence lengths efficiently across the batch. This combination is what enables high GPU utilization and throughput in servers like vLLM and TGI, as the GPU is kept constantly busy processing tokens from multiple requests at different generation stages.
Impact on PEFT & Multi-Adapter Serving
When serving models fine-tuned with Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA or Adapters, the KV Cache interacts uniquely with the inference process. For a merged model (where LoRA weights are fused into the base model), the KV Cache operates normally. However, in multi-adapter serving scenarios, where a single base model dynamically switches between different adapter sets:
- The base model's KV Cache computation remains largely unchanged, as the frozen base weights are used to compute the initial K and V projections.
- The adapter-specific modifications typically affect the query or later projections, but the cached K and V tensors from the base model can still be reused efficiently.
- This allows for efficient adapter switching without invalidating the entire cache, supporting multi-tenant or multi-task serving from a single GPU instance.
Operational Trade-offs & Eviction Policies
Managing the KV Cache involves key operational decisions:
- Cache Eviction: For very long conversations or document processing, the cache may exceed available memory. Systems implement eviction policies (e.g., FIFO, discarding oldest tokens) or context window sliding, which degrades performance as recomputation is needed.
- Precision: Caching in FP16 vs BF16 vs INT8 involves a trade-off between memory savings and potential precision loss affecting output quality.
- Warm-up: A model warm-up phase pre-populates the cache with a dummy forward pass to ensure memory allocators and kernels are primed, preventing high latency for the first real request (cold start).
- Observability: Monitoring cache hit rates, memory usage, and per-request cache allocation is crucial for performance debugging and capacity planning in production.
KV Cache Memory Footprint: Impact of Model Parameters
This table quantifies how key architectural parameters of a transformer model directly impact the memory required to store the KV Cache during autoregressive inference.
| Model Parameter | Small Model (7B Params) | Medium Model (70B Params) | Large Model (400B Params) |
|---|---|---|---|
KV Cache per Token | ~0.1 MB | ~1.0 MB | ~5.7 MB |
Sequence Length (Max Context) | 4096 | 8192 | 32768 |
Peak Cache for Full Context | ~410 MB | ~8.2 GB | ~187 GB |
Layers (Transformer Blocks) | 32 | 80 | 128 |
Attention Heads | 32 | 64 | 128 |
Head Dimension | 128 | 128 | 128 |
Precision (for Cache) | FP16 | FP16 | FP16 (w/ quantization) |
Typical GPU Memory Class | 24 GB (e.g., A10G) | 80 GB (e.g., A100/H100) | Multi-GPU/Node Required |
Serving Optimizations & Advanced Cache Techniques
The Key-Value (KV) Cache is a critical memory optimization for transformer inference that stores computed attention key and value tensors to avoid redundant computation during autoregressive text generation.
Core Mechanism & Purpose
During autoregressive decoding (generating one token at a time), a transformer's self-attention mechanism recomputes key (K) and value (V) tensors for all previous tokens in the sequence for each new token. The KV Cache stores these computed K and V tensors in a memory buffer. For each subsequent generation step, the model retrieves cached tensors for previous tokens and computes them only for the new token, transforming a quadratic computational complexity into a linear one. This is the primary driver behind the significant speed-up in text generation for models like GPT, Llama, and Claude.
Memory Footprint & Challenges
The KV Cache's memory consumption is substantial and grows linearly with:
- Batch Size: Number of sequences processed in parallel.
- Sequence Length: Number of tokens generated so far.
- Model Size: Number of layers and attention heads.
The formula is roughly: Memory ≈ 2 * batch_size * seq_len * n_layers * n_heads * d_head * bytes_per_param.
For a large model serving many long conversations, this can consume tens of gigabytes, often exceeding the memory required for the model weights themselves. This creates the central serving challenge: managing this dynamic, growing memory allocation efficiently across many concurrent requests.
PagedAttention (vLLM)
PagedAttention is a breakthrough optimization algorithm introduced by the vLLM inference engine. It solves the problem of memory fragmentation caused by variable-length sequences. It draws an analogy from operating system memory management:
- The KV Cache is divided into fixed-size blocks.
- The cache for each sequence is represented as a list of these blocks, which don't need to be contiguous in physical memory.
- This allows for non-contiguous storage, eliminating fragmentation waste.
- It enables highly efficient memory sharing for identical prompts in sampled decoding (e.g., in beam search). This technique is why vLLM achieves near-optimal memory utilization (over 90%) and high throughput.
Continuous Batching & Cache Management
Continuous Batching (or iterative batching) is an inference scheduling paradigm that optimizes how the KV Cache is utilized across requests. Unlike static batching, it continuously adds new requests to a running batch as other requests finish generation.
- The scheduler must manage the lifecycle of the KV Cache for each request (allocation, growth, and eviction).
- Finished sequences free their cache blocks, making them available for new sequences.
- This leads to significantly higher GPU utilization and throughput, as the GPU is kept constantly busy, and cache memory is recycled efficiently. Systems like TGI (Text Generation Inference) and vLLM implement this.
Quantization & Compression
To reduce the memory pressure of the KV Cache, several quantization and compression techniques are applied:
- KV Cache Quantization: Storing the K and V tensors in a lower precision format (e.g., FP8, INT8) instead of FP16/BF16. This can halve the memory footprint but may require calibration to minimize accuracy loss.
- Selective Caching: Not caching keys/values from certain layers or attention heads based on profiling that shows they are less critical for performance.
- Compressed Caching: Applying lossy compression algorithms to older tokens in very long contexts, as their influence on the next token often diminishes. These techniques are essential for serving long-context models (e.g., 128K+ tokens) cost-effectively.
Integration with PEFT Serving
Serving models fine-tuned with LoRA or Adapters introduces complexity for the KV Cache. The cache is computed from the active model weights. When using multi-adapter serving:
- The KV Cache is adapter-specific. The cached keys and values are a function of the base weights plus the active adapter's weights.
- Simply switching adapters on a shared base model makes the existing cache invalid, as the internal representations change.
- Solutions involve either:
- Cache recomputation on adapter switch (costly).
- Adapter-aware caching, where the cache is keyed by a
(sequence_id, adapter_id)pair, logically partitioning the memory. - Using merged weights for inference, which creates a standalone model with a single, consistent KV Cache.
Frequently Asked Questions
Key-Value (KV) Cache is a critical optimization for transformer inference. These questions address its core mechanics, implementation, and role in production serving systems.
The Key-Value (KV) Cache is a memory buffer used during the autoregressive generation of transformer models to store computed key and value tensors for previously processed tokens, avoiding redundant computation. During inference, a transformer's self-attention mechanism calculates a key (K) and value (V) vector for each token in the input sequence. For the first token in a new sequence, these are computed from scratch. For each subsequent token, the model only computes the K and V for the new token, while retrieving the K and V for all previous tokens from the cache. This eliminates the need to recompute attention over the entire growing sequence for every new token, transforming the computational complexity of generating a sequence of length N from O(N^2) to O(N), which results in massive latency and throughput improvements.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The KV Cache is a core component of high-performance transformer inference. These related terms define the ecosystem of techniques and systems that manage computation, memory, and latency during model serving.
Continuous Batching
Also known as iterative batching, this is an advanced inference optimization for autoregressive models like transformers. Unlike static batching, it allows new requests to be added to a running batch as previous requests finish generating tokens. This technique maximizes GPU utilization and throughput by keeping the computational hardware constantly occupied, and it works in tandem with the KV Cache to manage the state of multiple concurrent sequences efficiently.
Attention Mechanism
The core neural network component that the KV Cache optimizes. In the transformer architecture, self-attention computes a weighted sum of values (V) based on the compatibility between a query (Q) and a set of keys (K). For a generated token at position i, the attention scores are computed against all previous tokens 1...i. The KV Cache stores the pre-computed K and V tensors for these previous tokens, eliminating the need to recompute them for every new token, which is the source of its significant speedup.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us