KV Cache (Key-Value Cache) is a transformer inference optimization that stores the computed key and value tensors for all previous tokens in a sequence during autoregressive generation. By caching these intermediate attention mechanism states, the model avoids recalculating them for each new token, transforming the computational complexity of generating a sequence of length N from O(N²) to O(N), which results in drastically lower latency and reduced compute cost per token after the first.
