Inferensys

Glossary

Cache Eviction

Cache eviction is the algorithmic process of selectively removing entries from a Key-Value (KV) Cache or other context cache to manage memory usage, governed by a defined policy such as Least Recently Used (LRU).
Engineer optimizing context window usage on laptop, token usage charts visible, technical work session.
CONTEXT WINDOW MANAGEMENT

What is Cache Eviction?

Cache eviction is a critical memory management process in computing, especially for AI systems managing limited context windows.

Cache eviction is the automated process of removing entries from a cache according to a predefined policy to free space for new data. In AI systems, this is essential for managing the KV Cache during long-sequence generation and for maintaining context within an agent's working memory. Common policies include Least Recently Used (LRU), which discards the oldest-accessed data, and First-In-First-Out (FIFO), which removes the earliest cached items. The process is triggered when the cache reaches capacity, preventing memory overflow and ensuring efficient resource utilization.

For agentic workflows and context window management, eviction policies determine which historical tokens or conversational turns are purged first when the model's token limit is saturated. Effective eviction is a core component of frameworks like StreamingLLM, enabling infinite-length interactions by strategically maintaining attention sinks and a sliding window of relevant context. This balances the need for recent, task-relevant information against the finite inference memory available, directly impacting response coherence and operational latency.

CONTEXT WINDOW MANAGEMENT

Common Cache Eviction Policies

When a cache (like a KV Cache) reaches capacity, an eviction policy determines which entries are removed to make space for new data. The choice of policy directly impacts the performance and relevance of retained context for agentic workflows.

01

Least Recently Used (LRU)

Least Recently Used (LRU) evicts the data that has not been accessed for the longest time. It operates on the principle that recently used information is more likely to be needed again.

  • Mechanism: Typically implemented with a doubly linked list and a hash map. When data is accessed, it's moved to the front (most recent). The tail of the list holds the least recently used item for eviction.
  • Use Case: Highly effective for conversational agents where the most recent user query and the model's immediate prior response are critical for coherence.
  • Drawback: Can be inefficient if older context becomes relevant again later in a long session, leading to cache thrashing.
02

First-In-First-Out (FIFO)

First-In-First-Out (FIFO) evicts the oldest entry in the cache, regardless of how often or recently it has been used. It's a simple queue-based policy.

  • Mechanism: New entries are added to the back of a queue. When eviction is required, the entry at the front (oldest) is removed.
  • Use Case: Useful for streaming data or logging scenarios where historical order is the primary concern, or when access patterns are completely uniform.
  • Drawback: Often performs poorly in practice because it may evict frequently accessed, valuable context simply because it was loaded early.
03

Least Frequently Used (LFU)

Least Frequently Used (LFU) evicts the item that has been used the least number of times. It prioritizes retention based on historical popularity.

  • Mechanism: Maintains a counter for each cache entry that increments on every access. The entry with the smallest count is evicted.
  • Use Case: Can be effective for agents with repetitive tasks where certain foundational context (e.g., system prompts, core instructions) is referenced constantly.
  • Drawback: Prone to cache pollution. An item accessed many times in the distant past may persist indefinitely, blocking newer, potentially more relevant context. Requires more overhead to track frequencies.
04

Random Replacement (RR)

Random Replacement (RR) selects a candidate entry for eviction at random. It is a stateless, low-overhead policy.

  • Mechanism: On eviction, a pseudo-random number generator selects an index in the cache to remove.
  • Use Case: Provides a simple baseline. Its performance can be surprisingly competitive when access patterns are unpredictable or when the overhead of tracking recency/frequency is prohibitive.
  • Drawback: Unpredictable and can evict critical, hot context by chance, leading to performance volatility. Not suitable for deterministic or high-performance agentic systems.
05

Time-To-Live (TTL) Expiration

Time-To-Live (TTL) Expiration is not a capacity-driven policy but a time-based one. Each cache entry has a timestamp and is evicted once a predefined lifespan expires.

  • Mechanism: A background process or a check on access scans for entries whose (creation_time + TTL) < current_time.
  • Use Case: Essential for ensuring context freshness. Critical in multi-agent systems where cached state (e.g., sensor data, market prices) becomes stale and misleading after a certain period.
  • Drawback: Does not directly address capacity limits. Often used in conjunction with another policy (e.g., LRU-TTL) where entries are evicted due to expiry or capacity pressure.
06

Adaptive & Hybrid Policies

Modern systems often use adaptive or hybrid policies that combine strategies or adjust behavior based on runtime workload characteristics.

  • Examples:
    • LRU-K: Considers the time of the K-th-to-last access, better resisting one-hit wonders that pollute a standard LFU cache.
    • 2Q (Two Queues): Uses two queues: an LRU queue for items accessed once and a main LRU queue for items accessed multiple times, improving hit rates.
    • ARC (Adaptive Replacement Cache): Dynamically balances between recency (LRU) and frequency (LFU) by maintaining both lists and adapting their size based on misses.
  • Use Case: Complex, long-running autonomous agents with shifting context needs, where no single static policy is optimal.
CONTEXT WINDOW MANAGEMENT

How Cache Eviction Works in Transformer Inference

Cache eviction is a critical memory management process in transformer-based language models, determining which cached data is removed to make room for new computations when capacity limits are reached.

Cache eviction is the algorithmic process of selectively removing entries from a transformer's KV Cache or other intermediate state caches to manage finite memory resources during inference. It is triggered when the cache reaches its allocated capacity, such as when a sequence length exceeds the model's context window. The eviction policy—like Least Recently Used (LRU) or First-In-First-Out (FIFO)—defines the rule for which cached key-value vectors are discarded first, directly impacting generation speed and output coherence.

Efficient eviction is essential for long-context or streaming applications like StreamingLLM. Poor policies can force expensive recomputation or discard semantically critical tokens, degrading performance. Strategies often combine eviction with attention sink preservation and sliding window techniques to maintain stable, continuous generation. This low-level memory management is a core component of inference optimization, balancing latency, throughput, and hardware constraints.

CACHE EVICTION

Frequently Asked Questions

Cache eviction is a critical memory management process in AI systems, determining which data is removed from a cache when capacity is reached. This FAQ addresses its core mechanisms, policies, and role in optimizing agentic workflows.

Cache eviction is the automated process of removing entries from a cache—such as a KV Cache or conversational history buffer—to free up memory when the cache reaches its capacity limit. It works by continuously monitoring cache usage and applying a predefined eviction policy (e.g., LRU, FIFO) to select which items to remove. When a new item needs to be inserted into a full cache, the eviction algorithm identifies the least valuable entry according to its policy, deletes it, and then writes the new data in its place. This mechanism is fundamental to managing the context window in language models, ensuring efficient memory use for long interactions or document processing.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.