Cache coherence is a hardware-level protocol that maintains a single, consistent view of shared memory data across all processor caches in a multiprocessor system. It guarantees that any read of a memory location returns the most recently written value, regardless of which processor performed the write. This is essential for correct parallel program execution, preventing threads from operating on stale or conflicting data copies. Protocols like MESI (Modified, Exclusive, Shared, Invalid) use state tracking and inter-processor communication to enforce these rules.
Glossary
Cache Coherence

What is Cache Coherence?
Cache coherence is a fundamental property of shared-memory multiprocessor systems, including modern NPUs and GPUs, ensuring data consistency across distributed caches.
In NPU acceleration, cache coherence protocols manage data shared between multiple processing cores or tiles. Without coherence, parallel kernels could produce non-deterministic, erroneous results. The protocol's overhead—through snooping or directory-based messaging—is a critical design trade-off, impacting performance and scalability. Efficient coherence is vital for algorithms using shared memory for inter-thread communication and is closely related to the system's memory consistency model, which defines the visible ordering of memory operations.
Key Properties of a Coherent System
Cache coherence is a fundamental correctness property in shared-memory multiprocessor systems. It ensures that all processors observe a consistent view of memory, preventing data corruption and logical errors. The following principles define the mechanisms and guarantees required for a system to be considered coherent.
Write Propagation
This property guarantees that a write operation to a memory location by one processor will eventually become visible to all other processors. It prevents processors from reading stale data from their private caches after an update has occurred elsewhere in the system. Coherence protocols implement this through invalidation or update messages that are broadcast or sent to sharers of the cache line.
- Invalidation-based protocols: Mark other copies as invalid, forcing a miss on the next read.
- Update-based protocols: Propagate the new data value directly to all other caches holding the line.
Write Serialization
Also known as write ordering, this property ensures that all processors observe writes to the same memory location in the same sequential order. If two processors write to the same address, the system must define a global order for those writes. All subsequent reads by any processor must reflect that order. This is critical for implementing synchronization primitives like locks and barriers. The coherence protocol, often in conjunction with the memory consistency model, establishes this global order, typically by serializing writes through a single point of coordination, such as the home directory in a directory-based protocol.
Coherence States (MSI/MESI/MOESI)
Cache lines are tracked using finite state machines. Common protocols define states like:
- Modified (M): The cache holds the only valid copy and the data is dirty (different from main memory).
- Exclusive (E): The cache holds the only valid copy, but it is clean (matches main memory).
- Shared (S): The cache holds a valid, clean copy, but other caches may also hold it.
- Invalid (I): The data in the cache line is stale and cannot be used.
- Owned (O): (MOESI) The cache holds a dirty copy and is responsible for supplying it to other caches, but other caches may hold it in Shared state.
Transitions between these states are triggered by local processor operations (read, write) and coherence messages from other caches or the directory.
Snooping vs. Directory-Based Protocols
These are the two primary architectural approaches to implementing coherence.
- Snooping (Bus-Based): All caches monitor (snoop) a shared broadcast interconnect (e.g., a bus) for transactions. When a write is seen, caches invalidate or update their local copies. This is simple but does not scale well to many cores due to broadcast traffic.
- Directory-Based: A centralized or distributed directory tracks which caches hold copies of each memory block. On a write, point-to-point messages are sent only to the caches that actually hold the data (the sharers). This scales to large core counts (dozens to hundreds) used in modern servers and NPUs, as it avoids broadcast storms.
False Sharing
A major performance pitfall where two unrelated variables reside on the same cache line. If different processors write to these different variables, the coherence protocol treats it as a write to the same address, causing unnecessary invalidation traffic and cache line ping-pong. This severely degrades performance even though the processors are not logically sharing data. Mitigation involves padding data structures or aligning variables to cache line boundaries to ensure they occupy separate lines.
Memory Consistency Model Interaction
Cache coherence and memory consistency are separate but related concepts. Coherence defines the behavior of reads and writes to a single memory location. Consistency (e.g., Sequential Consistency, Total Store Order, Release Consistency) defines the observable order of reads and writes to different memory locations across threads.
A system can be coherent but not sequentially consistent. For example, writes to different addresses by one processor may be observed in different orders by other processors, even though each individual location's history is coherent. The coherence protocol must provide the necessary guarantees (like write serialization) that the chosen consistency model relies upon.
Cache Coherence Protocol Comparison
A comparison of fundamental hardware-based cache coherence protocols, detailing their operational mechanisms, performance characteristics, and implementation trade-offs for multi-core NPU and CPU systems.
| Protocol Feature / Metric | Snooping (Bus-Based) | Directory-Based | Token-Based |
|---|---|---|---|
Primary Coordination Mechanism | Broadcast & Snoop on shared bus/interconnect | Point-to-point messages via centralized/distributed directory | Circulation of ownership tokens |
Scalability (Core Count) | Poor (typically < 32 cores) | Excellent (100s to 1000s+ cores) | Good (10s to 100s of cores) |
Average Latency (Shared Read) | Low (if bus uncontended) | Medium (directory lookup overhead) | Variable (depends on token location) |
Bandwidth Consumption | High (broadcasts on every write) | Low (point-to-point invalidations/updates) | Medium (token passing messages) |
Write Serialization Enforcement | Bus arbitration provides total order | Directory acts as serialization point | Token possession guarantees serialization |
Silicon Area Overhead | Low | Medium to High (directory storage) | Low to Medium (token state per block) |
Typical Implementation | MESI, MOESI protocols | AMD Infinity Fabric, Intel QPI | Token Coherence, COMA architectures |
Handles False Sharing Efficiently |
Frequently Asked Questions
Cache coherence is a fundamental property of shared-memory multiprocessor systems, ensuring data consistency across private caches. This FAQ addresses core concepts, protocols, and its critical role in parallel computing and hardware acceleration.
Cache coherence is a property of a shared-memory multiprocessor system that guarantees all processor caches have a consistent view of shared data, meaning every read of a memory location returns the most recently written value to that location. It is necessary because without it, multiple cached copies of the same memory block could hold different values, leading to incorrect program execution, race conditions, and violations of the memory consistency model. This inconsistency arises when one processor writes to its local cache copy, making that copy dirty, while other processors retain stale, clean copies. Coherence protocols actively manage these states to maintain a single-writer or multiple-reader invariant across the system.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Cache coherence is a fundamental property enabling correct parallel execution. These related concepts define the synchronization, communication, and performance models that govern multi-processor and multi-threaded systems.
Memory Consistency Model
A memory consistency model defines the formal rules for the order in which memory operations (loads and stores) from different threads become visible to each other in a shared memory system. It provides the programmer's contract with the hardware. Cache coherence is a specific protocol that implements a particular consistency model (often sequential consistency) by ensuring all caches see a consistent view of memory. Weaker models (e.g., release consistency) allow for higher performance but require explicit programmer synchronization.
Atomic Operations
Atomic operations are indivisible read-modify-write instructions (e.g., fetch-and-add, compare-and-swap) that complete without interruption from other threads. They are a fundamental building block for lock-free and wait-free algorithms. Cache coherence protocols are essential for implementing atomic operations correctly across a multi-processor system, as they ensure the atomic instruction's effect is globally visible and ordered relative to other operations on the same memory location.
Memory Barrier (Fence)
A memory barrier (or memory fence) is a low-level instruction that enforces ordering constraints on memory operations issued before and after the barrier. In systems with weak memory consistency models, barriers are required to make shared writes visible and ensure correct synchronization. They interact directly with the cache coherence protocol, often forcing cache line flushes or invalidations to ensure a consistent global memory state is observed by all processors.
Non-Uniform Memory Access (NUMA)
NUMA is a multiprocessor architecture where memory access time depends on the memory location's physical proximity to the processor. Each processor has local, fast memory, but can also access slower remote memory attached to other processors. Cache coherence in NUMA systems (CC-NUMA) is more complex and costly, as maintaining consistency for remote cache lines requires communication across an interconnect, making data placement and migration critical for performance.
Data Race
A data race is a critical concurrency bug that occurs when two or more threads in a process access the same memory location concurrently, at least one access is a write, and the accesses are not ordered by proper synchronization (e.g., locks, atomics). While cache coherence ensures all processors see a consistent final value, it does not prevent races; it merely defines the order in which conflicting writes become visible. Races lead to undefined, non-deterministic program behavior.
MESI Protocol
The MESI protocol is a specific, widely implemented cache coherence protocol that defines four states for each cache line: Modified (M), Exclusive (E), Shared (S), and Invalid (I). It uses these states and a snooping or directory-based mechanism to track ownership and manage updates. For example, a write to a line in the Shared state requires invalidating all other copies, transitioning it to Modified. This protocol minimizes unnecessary bus traffic while ensuring consistency.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us