Inferensys

Glossary

KV Cache (Key-Value Cache)

KV Cache is a transformer optimization that stores computed key and value tensors for previous tokens during autoregressive generation, eliminating redundant computation and dramatically speeding up sequential token generation.
Performance engineer optimizing AI latency on laptop, latency charts visible, technical optimization session.
CONTEXT WINDOW MANAGEMENT

What is KV Cache (Key-Value Cache)?

A core optimization technique for transformer-based language models that dramatically accelerates sequential text generation.

KV Cache (Key-Value Cache) is a transformer inference optimization that stores the computed key and value tensors for all previous tokens in a sequence during autoregressive generation. By caching these intermediate attention mechanism states, the model avoids recalculating them for each new token, transforming the computational complexity of generating a sequence of length N from O(N²) to O(N), which results in drastically lower latency and reduced compute cost per token after the first.

The cache is implemented as a rolling buffer that grows with each generated token, directly consuming a portion of the model's context window. Managing this cache is critical; when the sequence length exceeds the context limit, an eviction policy (e.g., FIFO, LRU) must remove older key-value pairs. Techniques like sliding window attention and frameworks such as StreamingLLM are built upon efficient KV Cache management to enable infinite-length generation without catastrophic performance degradation.

INFERENCE OPTIMIZATION

Key Characteristics of KV Cache

The KV Cache is a critical performance optimization for transformer-based language models during autoregressive generation. Its primary function is to eliminate redundant computation by storing intermediate states.

01

Core Mechanism: Caching Attention States

During the autoregressive generation of a sequence (token-by-token), a transformer recomputes the self-attention mechanism for all previous tokens in each new step. The KV Cache stores the computed Key (K) and Value (V) tensors for all previous positions after the initial forward pass. For each new token generation step, the model only computes the K and V vectors for the new token and concatenates them with the cached tensors from previous steps. This transforms the computational complexity of generating a sequence of length N from O(N²) to O(N), providing dramatic speedups for long generations.

02

Memory vs. Compute Trade-off

The KV Cache introduces a fundamental engineering trade-off:

  • Compute Savings: Eliminates the quadratic recomputation of attention over the growing sequence history.
  • Memory Cost: The cache size grows linearly with both the sequence length and the model's hidden dimension size. For a model with n_layers layers, n_kv_heads key-value heads, and a hidden size d_head per head, the memory footprint for a sequence of length L is approximately 2 * n_layers * n_kv_heads * d_head * L * dtype_size. For large models and long contexts, this can consume multiple gigabytes of GPU memory, becoming a primary bottleneck for batch size and maximum context length.
03

Architectural Dependence (Decoder-Only Models)

The KV Cache is most essential for decoder-only transformer architectures (e.g., GPT, Llama, Mistral) used for autoregressive text generation. Its utility is inherent to the causal attention mask, which prevents tokens from attending to future tokens. This mask creates the redundancy that the cache exploits. In contrast:

  • Encoder-only models (e.g., BERT) use bidirectional attention and process the full sequence in one parallel pass, making a KV Cache unnecessary.
  • Encoder-decoder models (e.g., T5) may use a form of cache for the decoder's self-attention, but also perform cross-attention to the encoder's output, which is typically not cached in the same way.
04

Integration with Continuous Batching

In production inference servers, the KV Cache is managed at the batch level. Continuous batching (or iterative batching) is a technique where incoming requests of different sequence lengths are batched together dynamically. Each request in the batch has its own independent KV Cache. The inference engine must:

  • Allocate and manage heterogeneous cache sizes per request.
  • Handle padding efficiently within the batch's combined KV tensors.
  • Implement cache eviction for completed sequences to free memory. This complex memory management is a core feature of high-performance inference engines like vLLM, TGI (Text Generation Inference), and NVIDIA TensorRT-LLM.
05

Eviction and Memory Management

When the context window is full, or to manage memory across many concurrent requests, cache entries must be evicted. Common policies include:

  • Least Recently Used (LRU): Discards the key-value pairs for tokens that have not been attended to recently.
  • First-In-First-Out (FIFO): Evicts the oldest tokens in the sequence.
  • Sliding Window: Maintains a cache only for the most recent W tokens, providing a constant memory footprint. StreamingLLM identified the need to preserve a few initial tokens as "attention sinks" to maintain generation stability when using a sliding window. Advanced systems may also employ paged attention, which stores the cache in non-contiguous memory blocks (pages) to reduce fragmentation and waste.
06

Quantization and Compression

To reduce the memory footprint of the KV Cache, several quantization techniques are employed:

  • FP8 or INT8 Quantization: Storing the cache in lower precision (8-bit floating point or integer) instead of FP16 or BF16. This can halve memory usage but may require careful calibration to avoid generation quality degradation.
  • Selective Quantization: Applying aggressive quantization to older, less frequently accessed parts of the cache while keeping recent tokens in higher precision.
  • Dynamic Quantization: Adjusting precision per layer or per head based on sensitivity analysis. Research into KV Cache compression is active, exploring methods like pruning low-magnitude values or using low-rank approximations to represent the cached states.
ENGINEERING FAQ

Frequently Asked Questions About KV Cache

A technical deep dive into the Key-Value Cache, the core optimization that enables efficient autoregressive generation in transformer models by eliminating redundant computation.

The KV Cache (Key-Value Cache) is a memory optimization for transformer decoder models that stores the computed key (K) and value (V) tensors for all previously generated tokens during autoregressive text generation. During the first forward pass for a prompt, the model computes the K and V matrices for every token in the input sequence. For each subsequent token generation step, instead of recomputing K and V for all previous tokens—which would be an O(n²) operation—the model retrieves these tensors from the cache, computes K and V only for the new token, and performs attention using the concatenated cached and new tensors. This reduces the computational complexity of each generation step from O(n²) to O(n), where n is the sequence length, leading to dramatic latency reductions.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.