Inferensys

Glossary

NUMA (Non-Uniform Memory Access)

NUMA is a computer memory design for multiprocessors where memory access time depends on the memory location relative to the processor, with local memory being faster than non-local (remote) memory.
Developer demonstrating multi-agent tool use, agent tool selection interface on laptop, casual tech demo moment.
PARALLEL COMPUTING ARCHITECTURE

What is NUMA (Non-Uniform Memory Access)?

NUMA is a foundational computer architecture for modern multi-socket servers and high-performance computing systems, directly impacting how parallel workloads, including those for AI and machine learning, are scheduled and optimized.

NUMA (Non-Uniform Memory Access) is a shared-memory multiprocessor architecture where memory access latency depends on the memory location relative to the requesting processor. In a NUMA system, each processor or group of processors (a NUMA node) has its own local memory, which it can access with low latency. Accessing memory attached to a remote processor (remote memory) incurs significantly higher latency and bandwidth penalties. This design contrasts with Uniform Memory Access (UMA) architectures, where all processors share a single, centralized memory pool with equal access times.

For AI and high-performance computing, effective NUMA-aware scheduling is critical. The operating system and runtime environments use NUMA affinity policies to bind processes and their memory allocations to specific nodes, minimizing costly remote accesses. Performance degrades severely if a thread running on one node frequently accesses data allocated on another, a condition known as NUMA thrashing. Modern compilers and frameworks for parallelism and scheduling must explicitly manage data placement and thread pinning to achieve optimal throughput on NUMA systems, especially for memory-bound workloads common in large-scale model training and inference.

PARALLEL COMPUTING

Key Characteristics of NUMA Architecture

NUMA (Non-Uniform Memory Access) is a multiprocessor memory architecture where access time depends on the memory location relative to the processor. This design is fundamental for scaling modern multi-socket servers and high-core-count systems.

01

Memory Access Latency Hierarchy

The defining characteristic of NUMA is a memory latency hierarchy. Each processor or group of processors (a NUMA node) has its own local memory. Access to this local memory is fast. Access to memory attached to another processor (remote memory) is slower, traversing an interconnect like Intel's Ultra Path Interconnect (UPI) or AMD's Infinity Fabric. This creates a non-uniform access pattern where performance depends on data locality.

02

NUMA Node Structure

A NUMA node is the fundamental building block, typically comprising:

  • One or more CPU cores (a socket or a core complex).
  • A local bank of DRAM (the node's memory).
  • A local memory controller.
  • Links to other NUMA nodes via the interconnect. The operating system enumerates these nodes, and software can query topology via interfaces like libnuma on Linux. Efficient programming requires thread and memory affinity to keep processes and their data on the same node.
03

Interconnect and Coherency Protocol

NUMA nodes communicate via a high-speed interconnect. This hardware also runs a cache coherency protocol (e.g., MESI, MOESI) across the entire system. This protocol ensures that all processor caches have a consistent view of shared memory, even when data is migrated or replicated. The overhead of maintaining coherency across the interconnect for remote accesses is a primary source of NUMA penalty. Protocols are directory-based or snooping-based to track sharing states.

04

First-Touch Policy and Memory Allocation

Most operating systems use a first-touch policy for memory allocation. When a process first writes to a page of memory, that page is physically allocated from the DRAM of the NUMA node where the writing thread is executing. This can inadvertently pin data to a remote node if the initializing thread is not carefully placed. Explicit memory policy APIs (e.g., mbind(), numactl) allow developers to control allocation, such as interleaving pages across nodes for uniform bandwidth.

05

Performance Impact and Optimization

Poor NUMA awareness can degrade performance by 200-300% due to remote memory latency. Key optimization strategies include:

  • Thread Pinning: Binding threads to specific cores/nodes.
  • Data Placement: Allocating data on the node where it will be processed.
  • NUMA-Aware Algorithms: Designing parallel algorithms to minimize cross-node communication.
  • Topology Awareness: Using system tools (numactl, lstopo) to understand the hardware layout. The goal is to maximize accesses to local memory.
06

CC-NUMA vs. NCC-NUMA

Most modern systems are Cache-Coherent NUMA (CC-NUMA), where hardware maintains coherence across all processor caches, presenting a single shared memory image to software. Historically, Non-Cache-Coherent NUMA (NCC-NUMA) systems existed, requiring explicit software management for coherence, similar to clusters. CC-NUMA is the standard for commercial servers, workstations, and many high-performance computing systems, providing a easier programming model at the cost of complex interconnect hardware.

PARALLELISM AND SCHEDULING

How NUMA Architecture Works

NUMA (Non-Uniform Memory Access) is a shared-memory multiprocessor architecture where memory access latency depends on the physical location of the memory relative to the requesting processor.

In a NUMA system, each processor or group of processors (a NUMA node) has its own local memory bank. Access to this local memory is fast. Access to memory attached to a remote node (remote memory) is slower due to the interconnects traversed. This contrasts with Uniform Memory Access (UMA) architectures, where all memory is equidistant. The primary goal is to scale memory bandwidth with processor count, avoiding the bottleneck of a single, shared memory bus.

Efficient programming requires NUMA-aware scheduling and data placement. The operating system and applications should allocate memory on the node where the computation occurs to minimize remote accesses. This is managed via first-touch policies and explicit APIs like numactl. Performance degrades significantly if threads frequently access remote memory, a condition known as NUMA thrashing. This architecture is foundational in modern multi-socket servers and high-core-count CPUs.

MEMORY ARCHITECTURE

NUMA vs. UMA: Architectural Comparison

A comparison of Non-Uniform Memory Access (NUMA) and Uniform Memory Access (UMA) architectures, detailing their fundamental design principles, performance characteristics, and suitability for different parallel computing workloads.

Architectural FeatureNUMA (Non-Uniform Memory Access)UMA (Uniform Memory Access)

Core Design Principle

Distributed, shared memory organized into nodes. Each processor has local memory; remote memory access is possible but slower.

Centralized, symmetric shared memory. All processors access a single, unified memory pool via a common system bus or crossbar.

Memory Access Latency

Non-uniform. Local memory access is fast (~100 ns). Remote memory access incurs higher latency due to interconnect traversal (~200-300 ns).

Uniform. All memory accesses have statistically identical latency, determined by the shared bus/interconnect speed (~150 ns).

Scalability

High. Scales efficiently to many processors (dozens to hundreds) by adding nodes, minimizing bus contention.

Low. Limited by bus/interconnect bandwidth. Performance degrades significantly beyond ~4-8 processors due to contention.

Typical Interconnect

High-speed point-to-point links (e.g., HyperTransport, QuickPath Interconnect) forming a network.

Shared system bus (e.g., Front-Side Bus) or a crossbar switch.

Cache Coherence Protocol

Directory-based or Snooping protocols adapted for distributed memory (e.g., MOESI). More complex due to non-uniform latency.

Snooping protocol (e.g., MESI) over the shared bus. Simpler but generates more broadcast traffic.

System Cost & Complexity

Higher. Requires sophisticated memory controllers per node and complex coherence logic. Higher design and validation cost.

Lower. Simpler, centralized memory controller and coherence logic. Lower design cost.

Optimal Workload Type

Workloads with strong data locality, where threads can be scheduled on the node holding their required data (e.g., large-scale simulations, databases).

Small-scale symmetric multiprocessing (SMP) systems running general-purpose or workloads with poor data locality.

OS & Software Awareness

Requires NUMA-aware OS (for memory allocation policies) and applications (for thread/data placement) to avoid performance pitfalls.

Transparent to software. No special placement policies required; the OS treats all memory as equal.

Example Architectures

AMD EPYC, Intel Xeon Scalable (post-Nehalem), ARM-based server SoCs, modern multi-socket servers.

Traditional single-bus SMP systems, older multi-core CPUs (e.g., Intel Core 2 Quad), some embedded multicore processors.

PARALLELISM AND SCHEDULING

NUMA Optimization Techniques

Techniques to mitigate the performance penalty of Non-Uniform Memory Access (NUMA) in multi-socket servers by strategically placing data and threads close to the processors that need them.

01

Thread and Memory Affinity

The practice of pinning software threads to specific CPU cores and allocating memory from the local NUMA node of those cores. This ensures memory accesses are local, minimizing latency.

  • Tools: numactl (Linux), SetThreadAffinityMask (Windows).
  • Goal: Prevent the OS scheduler from migrating threads between sockets, which would cause all memory accesses to become remote.
  • Example: Binding a database process to cores on NUMA node 0 and allocating its memory from node 0.
02

First-Touch Allocation Policy

A default memory policy in many operating systems where a page of memory is allocated from the NUMA node of the CPU core that first writes to (touches) it.

  • Implication: Initializing data structures with the threads that will primarily use them is critical for optimal placement.
  • Pitfall: A master thread initializing all data can cause all memory to be allocated on one node, creating a hotspot.
  • Optimization: Parallelize initialization so each thread touches the data it will own.
03

Interleaving Allocation

A memory policy that distributes (stripes) pages of a large memory allocation round-robin across all available NUMA nodes.

  • Use Case: For applications with very large, uniformly accessed datasets where bandwidth is more critical than latency.
  • Effect: Averages out remote access latency and balances memory controller load.
  • Trade-off: Increases average latency compared to perfect local placement but prevents worst-case all-remote scenarios.
  • Implementation: Using numactl --interleave=all before launching an application.
04

NUMA-Aware Data Structures

Designing concurrent data structures (e.g., queues, hash maps, allocators) to partition data per NUMA node, minimizing cross-node communication.

  • Principle: Employ a sharding or partitioning scheme aligned with NUMA nodes.
  • Example: A concurrent hash map where each NUMA node manages its own stripe of buckets and associated locks. Threads on a node primarily access their local stripe.
  • Benefit: Dramatically reduces the frequency of expensive atomic operations and cache-line transfers across the interconnect (e.g., Intel UPI, AMD Infinity Fabric).
05

Remote Access Cost Measurement

Quantifying the performance differential between local and remote memory accesses to guide optimization efforts.

  • Typical Latency Ratio: Remote access can be 1.5x to 3x slower than local access, depending on the system architecture and interconnect.
  • Bandwidth Impact: Aggregate cross-socket bandwidth is typically lower than total local bandwidth.
  • Tools: Microbenchmarks (e.g., Intel MLC), performance counters monitoring UNC_CHA_TOR_INSERTS (Intel) for remote traffic, or numastat for OS-level page placement statistics.
1.5x - 3x
Typical Remote Access Latency Penalty
06

Application Topology Discovery

Programmatically querying the system's NUMA topology to make informed scheduling and allocation decisions at runtime.

  • APIs: hwloc (Portable Hardware Locality library), Linux /sys/devices/system/node/, Windows GetNumaNodeProcessorMaskEx.
  • Information Gathered: Number of nodes, cores per node, memory size per node, and distances (latency/affinity) between nodes.
  • Purpose: Enables adaptive software that configures its parallelism and memory use based on the actual hardware it's running on, essential for portable performance.
NUMA

Frequently Asked Questions

Non-Uniform Memory Access (NUMA) is a critical memory architecture for modern multi-socket servers and high-performance computing systems. It directly impacts application performance by defining how processors access memory. These FAQs address its core mechanisms, performance implications, and optimization strategies.

Non-Uniform Memory Access (NUMA) is a shared-memory multiprocessor architecture where memory access time depends on the memory location relative to the requesting processor. In a NUMA system, each processor or group of processors (a NUMA node) has its own local memory bank. Access to this local memory is fast. Access to memory attached to another processor (remote memory) is slower due to the need to traverse an interconnect (e.g., Intel's Ultra Path Interconnect (UPI) or AMD's Infinity Fabric). The operating system and applications must be aware of this hierarchy to optimize performance by minimizing remote memory accesses.

How it works:

  • The system is divided into NUMA nodes, each containing CPUs, a memory controller, and local RAM.
  • The OS (via a NUMA-aware scheduler) attempts to allocate memory from the local node where a thread is executing (first-touch policy).
  • A System Locality Information Table (SLIT) in ACPI describes the relative access latencies between all nodes.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.