Glossary

NUMA (Non-Uniform Memory Access)

NUMA is a computer memory design for multiprocessors where memory access time depends on the memory location relative to the processor, with local memory being faster than non-local (remote) memory.

Get in touch Learn more

Developer demonstrating multi-agent tool use, agent tool selection interface on laptop, casual tech demo moment.

PARALLEL COMPUTING ARCHITECTURE

What is NUMA (Non-Uniform Memory Access)?

NUMA is a foundational computer architecture for modern multi-socket servers and high-performance computing systems, directly impacting how parallel workloads, including those for AI and machine learning, are scheduled and optimized.

NUMA (Non-Uniform Memory Access) is a shared-memory multiprocessor architecture where memory access latency depends on the memory location relative to the requesting processor. In a NUMA system, each processor or group of processors (a NUMA node) has its own local memory, which it can access with low latency. Accessing memory attached to a remote processor (remote memory) incurs significantly higher latency and bandwidth penalties. This design contrasts with Uniform Memory Access (UMA) architectures, where all processors share a single, centralized memory pool with equal access times.

For AI and high-performance computing, effective NUMA-aware scheduling is critical. The operating system and runtime environments use NUMA affinity policies to bind processes and their memory allocations to specific nodes, minimizing costly remote accesses. Performance degrades severely if a thread running on one node frequently accesses data allocated on another, a condition known as NUMA thrashing. Modern compilers and frameworks for parallelism and scheduling must explicitly manage data placement and thread pinning to achieve optimal throughput on NUMA systems, especially for memory-bound workloads common in large-scale model training and inference.

PARALLEL COMPUTING

Key Characteristics of NUMA Architecture

NUMA (Non-Uniform Memory Access) is a multiprocessor memory architecture where access time depends on the memory location relative to the processor. This design is fundamental for scaling modern multi-socket servers and high-core-count systems.

Memory Access Latency Hierarchy

The defining characteristic of NUMA is a memory latency hierarchy. Each processor or group of processors (a NUMA node) has its own local memory. Access to this local memory is fast. Access to memory attached to another processor (remote memory) is slower, traversing an interconnect like Intel's Ultra Path Interconnect (UPI) or AMD's Infinity Fabric. This creates a non-uniform access pattern where performance depends on data locality.

NUMA Node Structure

A NUMA node is the fundamental building block, typically comprising:

One or more CPU cores (a socket or a core complex).
A local bank of DRAM (the node's memory).
A local memory controller.
Links to other NUMA nodes via the interconnect. The operating system enumerates these nodes, and software can query topology via interfaces like libnuma on Linux. Efficient programming requires thread and memory affinity to keep processes and their data on the same node.

Interconnect and Coherency Protocol

NUMA nodes communicate via a high-speed interconnect. This hardware also runs a cache coherency protocol (e.g., MESI, MOESI) across the entire system. This protocol ensures that all processor caches have a consistent view of shared memory, even when data is migrated or replicated. The overhead of maintaining coherency across the interconnect for remote accesses is a primary source of NUMA penalty. Protocols are directory-based or snooping-based to track sharing states.

First-Touch Policy and Memory Allocation

Most operating systems use a first-touch policy for memory allocation. When a process first writes to a page of memory, that page is physically allocated from the DRAM of the NUMA node where the writing thread is executing. This can inadvertently pin data to a remote node if the initializing thread is not carefully placed. Explicit memory policy APIs (e.g., mbind(), numactl) allow developers to control allocation, such as interleaving pages across nodes for uniform bandwidth.

Performance Impact and Optimization

Poor NUMA awareness can degrade performance by 200-300% due to remote memory latency. Key optimization strategies include:

Thread Pinning: Binding threads to specific cores/nodes.
Data Placement: Allocating data on the node where it will be processed.
NUMA-Aware Algorithms: Designing parallel algorithms to minimize cross-node communication.
Topology Awareness: Using system tools (numactl, lstopo) to understand the hardware layout. The goal is to maximize accesses to local memory.

CC-NUMA vs. NCC-NUMA

Most modern systems are Cache-Coherent NUMA (CC-NUMA), where hardware maintains coherence across all processor caches, presenting a single shared memory image to software. Historically, Non-Cache-Coherent NUMA (NCC-NUMA) systems existed, requiring explicit software management for coherence, similar to clusters. CC-NUMA is the standard for commercial servers, workstations, and many high-performance computing systems, providing a easier programming model at the cost of complex interconnect hardware.

PARALLELISM AND SCHEDULING

How NUMA Architecture Works

NUMA (Non-Uniform Memory Access) is a shared-memory multiprocessor architecture where memory access latency depends on the physical location of the memory relative to the requesting processor.

In a NUMA system, each processor or group of processors (a NUMA node) has its own local memory bank. Access to this local memory is fast. Access to memory attached to a remote node (remote memory) is slower due to the interconnects traversed. This contrasts with Uniform Memory Access (UMA) architectures, where all memory is equidistant. The primary goal is to scale memory bandwidth with processor count, avoiding the bottleneck of a single, shared memory bus.

Efficient programming requires NUMA-aware scheduling and data placement. The operating system and applications should allocate memory on the node where the computation occurs to minimize remote accesses. This is managed via first-touch policies and explicit APIs like numactl. Performance degrades significantly if threads frequently access remote memory, a condition known as NUMA thrashing. This architecture is foundational in modern multi-socket servers and high-core-count CPUs.

MEMORY ARCHITECTURE

NUMA vs. UMA: Architectural Comparison

A comparison of Non-Uniform Memory Access (NUMA) and Uniform Memory Access (UMA) architectures, detailing their fundamental design principles, performance characteristics, and suitability for different parallel computing workloads.

Architectural Feature	NUMA (Non-Uniform Memory Access)	UMA (Uniform Memory Access)
Core Design Principle	Distributed, shared memory organized into nodes. Each processor has local memory; remote memory access is possible but slower.	Centralized, symmetric shared memory. All processors access a single, unified memory pool via a common system bus or crossbar.
Memory Access Latency	Non-uniform. Local memory access is fast (~100 ns). Remote memory access incurs higher latency due to interconnect traversal (~200-300 ns).	Uniform. All memory accesses have statistically identical latency, determined by the shared bus/interconnect speed (~150 ns).
Scalability	High. Scales efficiently to many processors (dozens to hundreds) by adding nodes, minimizing bus contention.	Low. Limited by bus/interconnect bandwidth. Performance degrades significantly beyond ~4-8 processors due to contention.
Typical Interconnect	High-speed point-to-point links (e.g., HyperTransport, QuickPath Interconnect) forming a network.	Shared system bus (e.g., Front-Side Bus) or a crossbar switch.
Cache Coherence Protocol	Directory-based or Snooping protocols adapted for distributed memory (e.g., MOESI). More complex due to non-uniform latency.	Snooping protocol (e.g., MESI) over the shared bus. Simpler but generates more broadcast traffic.
System Cost & Complexity	Higher. Requires sophisticated memory controllers per node and complex coherence logic. Higher design and validation cost.	Lower. Simpler, centralized memory controller and coherence logic. Lower design cost.
Optimal Workload Type	Workloads with strong data locality, where threads can be scheduled on the node holding their required data (e.g., large-scale simulations, databases).	Small-scale symmetric multiprocessing (SMP) systems running general-purpose or workloads with poor data locality.
OS & Software Awareness	Requires NUMA-aware OS (for memory allocation policies) and applications (for thread/data placement) to avoid performance pitfalls.	Transparent to software. No special placement policies required; the OS treats all memory as equal.
Example Architectures	AMD EPYC, Intel Xeon Scalable (post-Nehalem), ARM-based server SoCs, modern multi-socket servers.	Traditional single-bus SMP systems, older multi-core CPUs (e.g., Intel Core 2 Quad), some embedded multicore processors.

PARALLELISM AND SCHEDULING

NUMA Optimization Techniques

Techniques to mitigate the performance penalty of Non-Uniform Memory Access (NUMA) in multi-socket servers by strategically placing data and threads close to the processors that need them.

Thread and Memory Affinity

The practice of pinning software threads to specific CPU cores and allocating memory from the local NUMA node of those cores. This ensures memory accesses are local, minimizing latency.

Tools: numactl (Linux), SetThreadAffinityMask (Windows).
Goal: Prevent the OS scheduler from migrating threads between sockets, which would cause all memory accesses to become remote.
Example: Binding a database process to cores on NUMA node 0 and allocating its memory from node 0.

First-Touch Allocation Policy

A default memory policy in many operating systems where a page of memory is allocated from the NUMA node of the CPU core that first writes to (touches) it.

Implication: Initializing data structures with the threads that will primarily use them is critical for optimal placement.
Pitfall: A master thread initializing all data can cause all memory to be allocated on one node, creating a hotspot.
Optimization: Parallelize initialization so each thread touches the data it will own.

Interleaving Allocation

A memory policy that distributes (stripes) pages of a large memory allocation round-robin across all available NUMA nodes.

Use Case: For applications with very large, uniformly accessed datasets where bandwidth is more critical than latency.
Effect: Averages out remote access latency and balances memory controller load.
Trade-off: Increases average latency compared to perfect local placement but prevents worst-case all-remote scenarios.
Implementation: Using numactl --interleave=all before launching an application.

NUMA-Aware Data Structures

Designing concurrent data structures (e.g., queues, hash maps, allocators) to partition data per NUMA node, minimizing cross-node communication.

Principle: Employ a sharding or partitioning scheme aligned with NUMA nodes.
Example: A concurrent hash map where each NUMA node manages its own stripe of buckets and associated locks. Threads on a node primarily access their local stripe.
Benefit: Dramatically reduces the frequency of expensive atomic operations and cache-line transfers across the interconnect (e.g., Intel UPI, AMD Infinity Fabric).

Remote Access Cost Measurement

Quantifying the performance differential between local and remote memory accesses to guide optimization efforts.

Typical Latency Ratio: Remote access can be 1.5x to 3x slower than local access, depending on the system architecture and interconnect.
Bandwidth Impact: Aggregate cross-socket bandwidth is typically lower than total local bandwidth.
Tools: Microbenchmarks (e.g., Intel MLC), performance counters monitoring UNC_CHA_TOR_INSERTS (Intel) for remote traffic, or numastat for OS-level page placement statistics.

1.5x - 3x

Typical Remote Access Latency Penalty

Application Topology Discovery

Programmatically querying the system's NUMA topology to make informed scheduling and allocation decisions at runtime.

APIs: hwloc (Portable Hardware Locality library), Linux /sys/devices/system/node/, Windows GetNumaNodeProcessorMaskEx.
Information Gathered: Number of nodes, cores per node, memory size per node, and distances (latency/affinity) between nodes.
Purpose: Enables adaptive software that configures its parallelism and memory use based on the actual hardware it's running on, essential for portable performance.

NUMA

Frequently Asked Questions

Non-Uniform Memory Access (NUMA) is a critical memory architecture for modern multi-socket servers and high-performance computing systems. It directly impacts application performance by defining how processors access memory. These FAQs address its core mechanisms, performance implications, and optimization strategies.

Non-Uniform Memory Access (NUMA) is a shared-memory multiprocessor architecture where memory access time depends on the memory location relative to the requesting processor. In a NUMA system, each processor or group of processors (a NUMA node) has its own local memory bank. Access to this local memory is fast. Access to memory attached to another processor (remote memory) is slower due to the need to traverse an interconnect (e.g., Intel's Ultra Path Interconnect (UPI) or AMD's Infinity Fabric). The operating system and applications must be aware of this hierarchy to optimize performance by minimizing remote memory accesses.

How it works:

The system is divided into NUMA nodes, each containing CPUs, a memory controller, and local RAM.
The OS (via a NUMA-aware scheduler) attempts to allocate memory from the local node where a thread is executing (first-touch policy).
A System Locality Information Table (SLIT) in ACPI describes the relative access latencies between all nodes.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

PARALLEL COMPUTING ARCHITECTURE

Related Terms

NUMA is a foundational concept in parallel computing, particularly relevant for modern multi-socket servers and high-performance computing clusters. Understanding its related architectural and scheduling concepts is crucial for optimizing workloads on NPUs and other accelerators.

SMP (Symmetric Multiprocessing)

Symmetric Multiprocessing is a multiprocessor architecture where all processors share a single, uniform memory space and have equal access to all I/O devices. This contrasts with NUMA, where memory access is non-uniform.

Uniform Memory Access (UMA): All memory accesses have the same latency, regardless of which processor makes the request.
Centralized Memory Controller: A single memory bus or interconnect serves all processors, which can become a bottleneck.
Use Case: Found in simpler, smaller-scale multi-core systems where latency uniformity simplifies programming but limits scalability.

ccNUMA (Cache-Coherent NUMA)

Cache-Coherent Non-Uniform Memory Access is the predominant form of NUMA in modern x86 and ARM servers. It extends the basic NUMA model by maintaining a single, coherent view of memory across all processor caches.

Hardware Coherence Protocol: Uses a directory-based or snooping protocol (e.g., MESI) to ensure all CPUs see the most recent value of a memory location.
Transparency: The coherence is handled entirely in hardware, making the system appear as a shared-memory machine to software, despite physical non-uniformity.
Performance Impact: Remote cache hits are faster than accessing remote main memory, but coherence traffic adds overhead to cross-socket communication.

NUMA Node

A NUMA Node is the fundamental building block of a NUMA system, consisting of one or more processors and their directly attached, local memory. The system is a collection of interconnected NUMA nodes.

Local Memory: Memory physically attached to the node's memory controllers. Access is fast (low latency, high bandwidth).
Remote Memory: Memory attached to a different NUMA node. Access incurs higher latency and potentially lower bandwidth across an interconnect (e.g., Intel UPI, AMD Infinity Fabric).
Affinity: A core concept where a process or thread is scheduled on a core within the node containing the data it uses most frequently to minimize remote accesses.

First-Touch Policy

The First-Touch Policy is a common operating system and runtime memory allocation strategy in NUMA systems. Memory pages are allocated on the NUMA node where the thread that first writes to (touches) them is currently executing.

Automatic Placement: Simplifies programming by automatically attempting to co-locate data with the using thread.
Potential Pitfall: If initialization is performed by a single thread (e.g., a main thread), all memory may be allocated on one node, leading to severe imbalance and remote access for worker threads on other nodes.
Mitigation: Requires careful parallelization of data initialization or explicit memory placement APIs (e.g., numactl, mbind in Linux).

Interconnect (e.g., UPI, Infinity Fabric)

The NUMA Interconnect is the high-speed link that connects NUMA nodes, enabling processors to access remote memory and maintain cache coherence. Its bandwidth and latency are critical to overall system performance.

Intel Ultra Path Interconnect (UPI): The point-to-point interconnect used in Intel Xeon Scalable processors for multi-socket communication.
AMD Infinity Fabric: AMD's scalable, coherent interconnect used within and between EPYC processor dies and sockets.
Performance Characteristic: Bandwidth for remote access is typically a fraction of local memory bandwidth, and latency can be 1.5x to 3x higher. Optimizing algorithms to minimize cross-interconnect traffic is a key NUMA tuning goal.

Memory Affinity / NUMA Pinning

Memory Affinity (or NUMA Pinning) is the explicit control of thread-to-core and memory-to-node binding to optimize performance on a NUMA system, overriding the OS's default scheduler and memory allocator.

Thread Pinning: Binding a specific thread or process to a set of cores on a particular NUMA node using taskset or numactl --cpunodebind.
Memory Pinning: Directing memory allocations to a specific NUMA node using numactl --membind or APIs like libnuma.
Goal: Ensures a thread runs on cores local to the memory it uses, eliminating unpredictable remote access penalties. Essential for achieving deterministic, high-performance in latency-sensitive and HPC workloads.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.