Zero Redundancy Optimizer (ZeRO): Memory Optimization for AI

MEMORY COMPRESSION TECHNIQUE

What is Zero Redundancy Optimizer (ZeRO)?

The Zero Redundancy Optimizer (ZeRO) is a suite of memory optimization techniques for distributed deep learning that partitions the model state—comprising optimizer states, gradients, and parameters—across all available devices. This eliminates the memory redundancy inherent in traditional data-parallel training, where each GPU holds a full copy of the entire model. By sharding these components, ZeRO enables the training of models that are orders of magnitude larger than the memory of any single accelerator.

ZeRO operates through three progressive stages: ZeRO-1 shards only the optimizer states, ZeRO-2 adds gradient partitioning, and ZeRO-3 partitions all model parameters. During forward and backward passes, parameters are gathered on-demand from across the device cluster and subsequently released, a process managed by the ZeRO-Offload and ZeRO-Infinity extensions. This approach transforms aggregate device memory into a pooled resource, fundamentally addressing the memory wall for large language models and other massive neural networks.

MEMORY COMPRESSION TECHNIQUES

Core Technical Characteristics

The Zero Redundancy Optimizer (ZeRO) is a memory optimization paradigm for distributed training that partitions optimizer states, gradients, and parameters across devices to eliminate memory redundancy. It enables the training of models with trillions of parameters by removing the memory replication inherent in traditional data-parallel training.

ZeRO Stage 1: Optimizer State Partitioning

In ZeRO Stage 1 (ZeRO-1), only the optimizer states (e.g., momentum, variance for Adam) are partitioned across the available GPUs. Each GPU stores and updates only a slice of the total optimizer states, reducing the per-GPU memory footprint of the optimizer by a factor of the number of data-parallel processes (N_d).

Mechanism: The full model parameters and gradients are still replicated on each GPU.
Memory Reduction: Optimizer state memory is reduced from O(P) to O(P/N_d), where P is the number of model parameters.
Communication: Requires an all-gather of parameters before the forward pass and a reduce-scatter of gradients after the backward pass, but optimizer updates are local.

ZeRO Stage 2: Gradient Partitioning

ZeRO Stage 2 (ZeRO-2) builds on Stage 1 by also partitioning the model gradients across GPUs. After the backward pass, each GPU only retains the gradients corresponding to its slice of the parameters.

Mechanism: Optimizer states and gradients are partitioned. Full parameters are still replicated.
Memory Reduction: Gradient memory is reduced from O(P) to O(P/N_d). Combined with Stage 1, this significantly reduces memory during the backward pass.
Communication: Maintains the same all-gather/reduce-scatter pattern for parameters/gradients. Since each GPU has the gradients needed for its slice of optimizer states, updates remain local.

ZeRO Stage 3: Parameter Partitioning

ZeRO Stage 3 (ZeRO-3) is the most memory-efficient stage, partitioning the model parameters themselves across GPUs, in addition to optimizer states and gradients. At any given time, a GPU only holds a fraction of the full model.

Mechanism: Implements full model parallelism for memory purposes. Parameters are all-gathered on-demand for forward/backward computation and then released immediately after use.
Memory Reduction: Model parameter memory is reduced from O(P) to O(P/N_d). This enables training models larger than the memory of any single GPU.
Communication Overhead: Highest among ZeRO stages, due to frequent all-gather and reduce-scatter operations for parameters.

ZeRO-Offload & ZeRO-Infinity

These are extensions that leverage CPU and NVMe memory to scale beyond GPU memory limits.

ZeRO-Offload: Partitions optimizer states, gradients, and some model parameters between GPU and host CPU memory. It uses the CPU for optimizer steps, enabling the training of larger models on a single GPU or smaller GPU clusters.
ZeRO-Infinity: Extends partitioning to NVMe (solid-state drive) storage. It can offload tensors to SSD, treating it as a slower tier of memory. This allows for trillion-parameter model training by aggregating GPU, CPU, and NVMe memory across a system.
Key Innovation: Uses asynchronous I/O and overlapped computation to mitigate the latency of moving data to/from CPU/NVMe.

EXPLORE

Communication Patterns & Overhead

ZeRO's memory savings come with increased inter-GPU communication, which is managed through specific collective operations.

All-Gather: Used to collect partitioned tensors (e.g., parameters in ZeRO-3) from all GPUs to form a complete tensor for computation.
Reduce-Scatter: Used to distribute and sum a complete tensor (e.g., gradients) across GPUs, so each GPU gets a unique slice of the result.
Bandwidth vs. Memory Trade-off: ZeRO trades higher network bandwidth utilization for drastically lower GPU memory consumption. The overhead is often acceptable as modern interconnects (e.g., NVLink, InfiniBand) provide high bandwidth.
Optimization: Techniques like communication overlapping (hiding communication behind computation) are critical for maintaining training throughput.

Integration with Model Parallelism

ZeRO is fundamentally a data parallelism technique. It is often combined with other forms of parallelism for maximum scalability.

Pipeline Parallelism: ZeRO can be used within a pipeline stage to reduce memory for the model chunk residing on a set of GPUs. This combination is common in frameworks like DeepSpeed.
Tensor Parallelism: ZeRO can be layered on top of tensor (intra-layer) parallelism. Here, ZeRO's data-parallel group is formed across tensor-parallel groups, partitioning the already-sharded model further.
3D Parallelism: The combination of ZeRO (Data Parallelism), Pipeline Parallelism, and Tensor Parallelism is termed 3D parallelism. It is the state-of-the-art strategy for training the largest foundation models, allowing independent scaling across the three dimensions of compute resources.

MEMORY OPTIMIZATION

How ZeRO Works: The Partitioning Mechanism

ZeRO eliminates memory redundancy by partitioning the three primary memory components of a model—optimizer states, gradients, and parameters—across all available GPUs in a data-parallel group. Instead of each device holding a full copy, each GPU stores only a unique shard of these tensors. During the forward and backward passes, the framework orchestrates all-gather operations to temporarily reconstruct the full parameters needed for computation, before scattering the updated data back to their respective shard owners. This partitioning is the core mechanism that enables the training of models far larger than the memory of any single device.

The strategy is implemented in progressive stages: ZeRO-Stage 1 partitions only the optimizer states, ZeRO-Stage 2 adds gradient partitioning, and ZeRO-Stage 3 partitions all model parameters. Each stage increases memory savings at the cost of additional communication overhead. Crucially, ZeRO operates within a data-parallel framework, meaning each GPU processes a different mini-batch, preserving training efficiency while dramatically reducing the per-device memory footprint. This makes it foundational for training modern large language models (LLMs) and other massive neural networks.

ZERO REDUNDANCY OPTIMIZER (ZERO)

Frequently Asked Questions

The Zero Redundancy Optimizer (ZeRO) is a foundational memory optimization paradigm for distributed deep learning training. It systematically partitions model states across devices to eliminate memory redundancy, enabling the training of models with trillions of parameters. This FAQ addresses its core mechanisms, stages, and practical implications for engineers.

The Zero Redundancy Optimizer (ZeRO) is a memory optimization paradigm for distributed training that partitions the optimizer states, gradients, and model parameters across all available devices (e.g., GPUs) to eliminate memory redundancy. It works by sharding these components so that each processor stores and updates only a unique slice, using collective communication operations to gather the necessary data on-demand during the forward and backward passes. This approach transforms the classical data-parallel training workflow, where every GPU holds a full copy of the model, into a memory-efficient distributed system, dramatically increasing the feasible model size for a given hardware configuration.

MEMORY COMPRESSION TECHNIQUES

Related Terms

ZeRO is a cornerstone of memory optimization for large-scale training. These related concepts represent other critical techniques for reducing the computational and storage footprint of AI models and systems.

Model Parallelism

A distributed training strategy where a single neural network model is partitioned across multiple devices (GPUs/TPUs). Unlike data parallelism, where each device holds a full copy of the model, model parallelism splits the model's layers, parameters, or tensors to train models that exceed the memory of a single device. It is often combined with ZeRO, which can be viewed as a form of optimizer state model parallelism.

Types: Includes tensor parallelism (splitting individual layers) and pipeline parallelism (splitting groups of layers).
Use Case: Essential for training models with hundreds of billions of parameters, such as modern large language models.

Gradient Checkpointing

A memory-for-compute trade-off technique that dramatically reduces the memory required during the backward pass of neural network training. Instead of storing all intermediate activations (which consumes most of the training memory), checkpointing selectively saves only a subset of activations. The non-saved activations are recomputed on-demand during the backward pass. This can reduce activation memory by up to 5-10x at the cost of approximately 30% more compute time.

Mechanism: The forward pass is divided into segments. Only the inputs to each segment are stored.
Synergy with ZeRO: Often used alongside ZeRO, as ZeRO optimizes parameter/gradient/optimizer state memory, while checkpointing optimizes activation memory.

Quantization

A model compression technique that reduces the numerical precision of a model's weights and activations. By converting values from 32-bit floating-point (FP32) to lower precision formats like 16-bit (FP16/BF16), 8-bit integers (INT8), or even 4-bit (NF4), quantization drastically cuts memory usage and can accelerate computation. Post-training quantization (PTQ) applies this after training, while quantization-aware training (QAT) incorporates it during training for better accuracy retention.

Primary Benefit: Reduces memory footprint and bandwidth requirements.
Distinction from ZeRO: ZeRO partitions full-precision parameters; quantization reduces the bit-width of the parameters themselves. They are highly complementary.

Gradient Compression

An optimization for distributed training that reduces the communication overhead between workers. Before performing the all-reduce operation to synchronize gradients, techniques like gradient sparsification (only sending the top-k largest gradients) or gradient quantization (reducing gradient precision) are applied. This minimizes network bandwidth, which is often the bottleneck in data-parallel training.

Relation to ZeRO: ZeRO Stage 1 partitions optimizer states but still communicates full gradients. Gradient compression can be layered on top of ZeRO to further reduce communication volume for the gradients that are exchanged.
Trade-off: Introduces a slight approximation error, which must be managed to ensure stable convergence.

Mixture of Experts (MoE)

A conditional computation architecture that enables dramatically larger model capacity without a proportional increase in computational cost. An MoE layer consists of many specialized sub-networks ("experts"). For each input token, a sparse gating network selects only a small subset of experts (e.g., 2 out of 128) to process the data. This creates a sparse, activation-dependent model.

Memory vs. Compute: MoE models have a huge parameter count (memory-heavy) but a manageable active compute footprint per token.
Synergy with ZeRO: Training massive MoE models (like those with trillions of parameters) is only feasible using memory optimization paradigms like ZeRO to handle the enormous but sparse parameter set.

Deep Compression

A holistic, three-stage pipeline for neural network compression, famously outlined by Han et al. It sequentially applies:

Pruning: Removes insignificant weights.
Quantization: Reduces the precision of remaining weights.
Huffman Coding: Applies lossless entropy coding to the quantized weights. This pipeline can reduce model size by 35x to 49x without loss of accuracy.

Philosophical Difference: Deep Compression is primarily for inference-time model deployment, creating a compact, static model. ZeRO is a training-time memory optimization for enabling the training of models that would otherwise be impossible.
Complementarity: A model trained using ZeRO could later be compressed via Deep Compression for efficient deployment.

Core Technical Characteristics

Mechanism: The full model parameters and gradients are still replicated on each GPU.
Memory Reduction: Optimizer state memory is reduced from O(P) to O(P/N_d), where P is the number of model parameters.
Communication: Requires an all-gather of parameters before the forward pass and a reduce-scatter of gradients after the backward pass, but optimizer updates are local.

Frequently Asked Questions