Cache eviction is the automated process of removing entries from a cache according to a predefined policy to free space for new data. In AI systems, this is essential for managing the KV Cache during long-sequence generation and for maintaining context within an agent's working memory. Common policies include Least Recently Used (LRU), which discards the oldest-accessed data, and First-In-First-Out (FIFO), which removes the earliest cached items. The process is triggered when the cache reaches capacity, preventing memory overflow and ensuring efficient resource utilization.
Glossary
Cache Eviction

What is Cache Eviction?
Cache eviction is a critical memory management process in computing, especially for AI systems managing limited context windows.
For agentic workflows and context window management, eviction policies determine which historical tokens or conversational turns are purged first when the model's token limit is saturated. Effective eviction is a core component of frameworks like StreamingLLM, enabling infinite-length interactions by strategically maintaining attention sinks and a sliding window of relevant context. This balances the need for recent, task-relevant information against the finite inference memory available, directly impacting response coherence and operational latency.
Common Cache Eviction Policies
When a cache (like a KV Cache) reaches capacity, an eviction policy determines which entries are removed to make space for new data. The choice of policy directly impacts the performance and relevance of retained context for agentic workflows.
Least Recently Used (LRU)
Least Recently Used (LRU) evicts the data that has not been accessed for the longest time. It operates on the principle that recently used information is more likely to be needed again.
- Mechanism: Typically implemented with a doubly linked list and a hash map. When data is accessed, it's moved to the front (most recent). The tail of the list holds the least recently used item for eviction.
- Use Case: Highly effective for conversational agents where the most recent user query and the model's immediate prior response are critical for coherence.
- Drawback: Can be inefficient if older context becomes relevant again later in a long session, leading to cache thrashing.
First-In-First-Out (FIFO)
First-In-First-Out (FIFO) evicts the oldest entry in the cache, regardless of how often or recently it has been used. It's a simple queue-based policy.
- Mechanism: New entries are added to the back of a queue. When eviction is required, the entry at the front (oldest) is removed.
- Use Case: Useful for streaming data or logging scenarios where historical order is the primary concern, or when access patterns are completely uniform.
- Drawback: Often performs poorly in practice because it may evict frequently accessed, valuable context simply because it was loaded early.
Least Frequently Used (LFU)
Least Frequently Used (LFU) evicts the item that has been used the least number of times. It prioritizes retention based on historical popularity.
- Mechanism: Maintains a counter for each cache entry that increments on every access. The entry with the smallest count is evicted.
- Use Case: Can be effective for agents with repetitive tasks where certain foundational context (e.g., system prompts, core instructions) is referenced constantly.
- Drawback: Prone to cache pollution. An item accessed many times in the distant past may persist indefinitely, blocking newer, potentially more relevant context. Requires more overhead to track frequencies.
Random Replacement (RR)
Random Replacement (RR) selects a candidate entry for eviction at random. It is a stateless, low-overhead policy.
- Mechanism: On eviction, a pseudo-random number generator selects an index in the cache to remove.
- Use Case: Provides a simple baseline. Its performance can be surprisingly competitive when access patterns are unpredictable or when the overhead of tracking recency/frequency is prohibitive.
- Drawback: Unpredictable and can evict critical, hot context by chance, leading to performance volatility. Not suitable for deterministic or high-performance agentic systems.
Time-To-Live (TTL) Expiration
Time-To-Live (TTL) Expiration is not a capacity-driven policy but a time-based one. Each cache entry has a timestamp and is evicted once a predefined lifespan expires.
- Mechanism: A background process or a check on access scans for entries whose
(creation_time + TTL) < current_time. - Use Case: Essential for ensuring context freshness. Critical in multi-agent systems where cached state (e.g., sensor data, market prices) becomes stale and misleading after a certain period.
- Drawback: Does not directly address capacity limits. Often used in conjunction with another policy (e.g., LRU-TTL) where entries are evicted due to expiry or capacity pressure.
Adaptive & Hybrid Policies
Modern systems often use adaptive or hybrid policies that combine strategies or adjust behavior based on runtime workload characteristics.
- Examples:
- LRU-K: Considers the time of the K-th-to-last access, better resisting one-hit wonders that pollute a standard LFU cache.
- 2Q (Two Queues): Uses two queues: an LRU queue for items accessed once and a main LRU queue for items accessed multiple times, improving hit rates.
- ARC (Adaptive Replacement Cache): Dynamically balances between recency (LRU) and frequency (LFU) by maintaining both lists and adapting their size based on misses.
- Use Case: Complex, long-running autonomous agents with shifting context needs, where no single static policy is optimal.
How Cache Eviction Works in Transformer Inference
Cache eviction is a critical memory management process in transformer-based language models, determining which cached data is removed to make room for new computations when capacity limits are reached.
Cache eviction is the algorithmic process of selectively removing entries from a transformer's KV Cache or other intermediate state caches to manage finite memory resources during inference. It is triggered when the cache reaches its allocated capacity, such as when a sequence length exceeds the model's context window. The eviction policy—like Least Recently Used (LRU) or First-In-First-Out (FIFO)—defines the rule for which cached key-value vectors are discarded first, directly impacting generation speed and output coherence.
Efficient eviction is essential for long-context or streaming applications like StreamingLLM. Poor policies can force expensive recomputation or discard semantically critical tokens, degrading performance. Strategies often combine eviction with attention sink preservation and sliding window techniques to maintain stable, continuous generation. This low-level memory management is a core component of inference optimization, balancing latency, throughput, and hardware constraints.
Frequently Asked Questions
Cache eviction is a critical memory management process in AI systems, determining which data is removed from a cache when capacity is reached. This FAQ addresses its core mechanisms, policies, and role in optimizing agentic workflows.
Cache eviction is the automated process of removing entries from a cache—such as a KV Cache or conversational history buffer—to free up memory when the cache reaches its capacity limit. It works by continuously monitoring cache usage and applying a predefined eviction policy (e.g., LRU, FIFO) to select which items to remove. When a new item needs to be inserted into a full cache, the eviction algorithm identifies the least valuable entry according to its policy, deletes it, and then writes the new data in its place. This mechanism is fundamental to managing the context window in language models, ensuring efficient memory use for long interactions or document processing.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Cache eviction is a critical component of context management, determining which data is removed from a finite memory store. These related concepts define the policies, mechanisms, and systems that govern this process.
KV Cache (Key-Value Cache)
The KV Cache is a transformer optimization that stores pre-computed key and value tensors for previously generated tokens during autoregressive inference. This cache eliminates redundant computation for the attention mechanism, dramatically speeding up sequential token generation. Its size is directly proportional to the context length, making efficient cache eviction essential for managing long sequences without exhausting GPU memory.
- Purpose: Accelerates generation by caching attention layer outputs.
- Memory Footprint: Grows with batch size, sequence length, and model dimensions.
- Eviction Trigger: When the cache exceeds available memory or a predefined token limit, an eviction policy must decide which cached tensors to discard.
Context Eviction Policy
A context eviction policy is a deterministic rule set that selects which entries to remove from a cache when capacity is reached. It is the algorithmic core of the cache eviction process. Common policies include:
- Least Recently Used (LRU): Discards the item that hasn't been accessed for the longest time. Effective for conversational agents where recent turns are most relevant.
- First-In-First-Out (FIFO): Removes the oldest item in the cache, regardless of usage. Simple but can evict critical, frequently accessed context.
- Least Frequently Used (LFU): Evicts the item with the fewest accesses. Can be problematic if early, important context is never re-accessed.
- Random Replacement: Randomly selects an item for eviction. Low overhead but unpredictable. The choice of policy directly impacts an agent's ability to retain pertinent operational state.
Context Window Saturation
Context window saturation occurs when a model's fixed token limit is fully utilized, preventing the injection of new tokens without first removing existing ones. This is the primary condition that triggers cache eviction. In agentic workflows, saturation forces a trade-off:
- Information Loss: Evicting context may discard critical instructions, historical facts, or intermediate reasoning steps.
- Performance Degradation: Models can "forget" early context, leading to incoherent or contradictory outputs.
- Management Overhead: Requires systems to monitor token counts and execute eviction or compression routines automatically. Saturation is an engineering constraint that defines the working memory boundary for any single inference call.
StreamingLLM
StreamingLLM is a framework that enables language models trained with a finite context window to process infinite-length text streams without fine-tuning. It achieves this by leveraging attention sinks—the observation that initial tokens receive disproportionately high attention scores—and maintaining a fixed-size sliding window cache of recent tokens.
- Mechanism: Keeps the first few tokens (the attention sink) and the most recent N tokens in the KV Cache, evicting the middle tokens.
- Eviction Implication: Implements a hybrid eviction policy: absolute retention of sink tokens and FIFO eviction for the recent token window.
- Use Case: Essential for long-running chatbots, log processing, and real-time transcription where context far exceeds the native window.
Context Caching
Context caching is the broader strategy of storing and reusing previously computed context—such as KV Cache states, summarized dialogue history, or retrieved document chunks—across multiple inference calls or user sessions. Cache eviction is the cleanup mechanism for this strategy.
- Objective: Reduce latency and computational cost by avoiding redundant processing of static or repeated context.
- Examples: Caching system prompts, frequently accessed knowledge base snippets, or the state of a multi-step planning process.
- Challenge: Requires invalidation logic (eviction) to ensure cached data does not become stale or incorrectly applied to new queries.
Sliding Window Attention
Sliding window attention is an efficient attention mechanism where a transformer model only attends to a fixed-size window of the most recent tokens. This creates a built-in, implicit cache eviction policy: as new tokens are added, tokens that fall outside the window are automatically excluded from attention calculations.
- Memory Efficiency: Provides a constant, predictable memory cost (O(window_size)) for sequences of arbitrary length.
- Eviction Behavior: Implements a strict FIFO eviction from the attention scope; older tokens are "forgotten" by the model's attention heads.
- Trade-off: While efficient, it limits the model's ability to draw connections between distant parts of a very long sequence.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us