Key-Value (KV) caching is an inference optimization technique for autoregressive transformer models that stores the computed key (K) and value (V) vectors for all previous tokens in a sequence to avoid redundant computation during subsequent generation steps.
During the self-attention mechanism, each token in the input sequence is projected into a query (Q), key (K), and value (V) vector. For a new token being generated, its query vector must be compared (via dot product) with the key vectors of all preceding tokens to calculate attention scores. Without caching, the K and V vectors for all previous tokens would need to be recomputed from scratch for each new generation step, leading to O(n²) computational complexity in sequence length.
KV caching eliminates this redundancy. After processing token t, its computed Kₜ and Vₜ vectors are stored in a KV cache. When generating token t+1, the model only computes the Q, K, and V for this new token. The K and V vectors for tokens 0 to t are retrieved from the cache, bypassing their recomputation. This reduces the complexity of the attention step for generation to O(n), leading to dramatic latency reductions, especially for long sequences common in agentic workflows.