Context caching is a computational optimization strategy that stores intermediate states—most notably the Key-Value (KV) Cache from transformer attention mechanisms—generated during a language model's forward pass to eliminate redundant processing in subsequent inference calls. By caching these pre-computed tensors for tokens that remain static across multiple requests (such as a system prompt or a long document prefix), the model only needs to compute attention for new tokens, dramatically reducing latency and computational cost. This technique is fundamental for enabling efficient multi-turn conversations and streaming generation within agentic workflows.
