Inferensys

Glossary

ZeRO (Zero Redundancy Optimizer)

ZeRO is a memory optimization technique for distributed training that partitions model states (optimizer states, gradients, parameters) across data-parallel processes to eliminate memory redundancy.
Data scientist building training data pipeline on laptop, data preprocessing visible, technical workspace.
PARAMETER-EFFICIENT FINE-TUNING

What is ZeRO (Zero Redundancy Optimizer)?

ZeRO is a foundational memory optimization framework for distributed training that enables the training of models with trillions of parameters by eliminating redundant storage across data-parallel processes.

ZeRO (Zero Redundancy Optimizer) is a memory optimization technique for distributed data-parallel training that partitions the three primary model states—optimizer states, gradients, and model parameters—across GPUs instead of replicating them. This systematic elimination of memory redundancy allows the aggregate GPU memory of a cluster to be used as a single, large pool, enabling the training of models far larger than what could fit on any single device. The technique is implemented through progressive stages, ZeRO-1, ZeRO-2, and ZeRO-3, each sharding more state components to achieve greater memory savings at the cost of increased communication overhead.

The core innovation is its sharding strategy. In standard data parallelism, every GPU holds a full copy of the model and its optimizer. ZeRO partitions these states: ZeRO-1 shards only the optimizer states, ZeRO-2 adds gradient partitioning, and ZeRO-3, also known as Fully Sharded Data Parallel (FSDP), shards parameters as well. During the forward and backward passes, parameters are gathered on-demand from across devices and then re-sharded, a process managed efficiently to minimize latency. This makes ZeRO essential for training massive models like those in the Mixture-of-Experts (MoE) and large language model families where memory, not compute, is the primary constraint.

MEMORY OPTIMIZATION

Key Features of ZeRO

ZeRO (Zero Redundancy Optimizer) is a suite of memory optimization stages for distributed training that partitions model states across data-parallel processes to eliminate memory redundancy.

01

ZeRO Stage 1: Optimizer State Partitioning

Optimizer states (e.g., momentum, variance for Adam) are partitioned across data-parallel processes. Each process stores and updates only its assigned shard, reducing memory footprint by ~4x.

  • Mechanism: The optimizer is sharded. During the backward pass, each process gathers the necessary optimizer states for its local parameter update.
  • Impact: Enables training of models ~4x larger than standard data parallelism, as optimizer states are a major memory consumer for adaptive optimizers like Adam.
02

ZeRO Stage 2: Gradient Partitioning

Gradients are partitioned across processes in addition to optimizer states. After the backward pass, each process only retains the gradients for its assigned parameter partition.

  • Mechanism: Uses a reduce-scatter operation instead of an all-reduce. Each GPU becomes responsible for a unique subset of gradients.
  • Impact: Provides an additional ~2x memory savings over Stage 1. Total memory reduction is ~8x compared to standard data parallelism, enabling even larger models.
03

ZeRO Stage 3: Parameter Partitioning

The model parameters themselves are partitioned across GPUs. Each process only stores the parameters for its assigned partition, fetching others on-demand during forward and backward passes.

  • Mechanism: Parameters are gathered just-in-time for computation and discarded afterward. This introduces communication overhead but maximizes memory savings.
  • Impact: Enables memory reduction linear with the degree of data parallelism (N). In theory, a model of size M can be trained across N GPUs with ~M/N memory per GPU.
05

Communication Patterns & Overhead

ZeRO trades increased communication for reduced memory. The overhead is managed through optimized collective operations.

  • Stage 1/2: Use reduce-scatter and all-gather operations. Overhead is moderate and often offset by the ability to use larger batch sizes.
  • Stage 3: Requires frequent all-gather and reduce-scatter for every layer, increasing communication volume. This is mitigated by overlapping communication with computation (communication hiding).
MEMORY OPTIMIZATION STRATEGIES

ZeRO Stages: Comparison and Trade-offs

A comparison of the three primary ZeRO (Zero Redundancy Optimizer) stages, detailing their memory partitioning strategies, communication overhead, and scalability trade-offs for distributed training of large models.

Optimized ComponentZeRO-1 (Optimizer State Partitioning)ZeRO-2 (Gradient Partitioning)ZeRO-3 (Parameter Partitioning)

Partitioned Model States

Optimizer States

Optimizer States, Gradients

Optimizer States, Gradients, Parameters

Memory Reduction per GPU

4x

8x

Nx (Linear with # of GPUs)

Communication Volume

Low

Moderate

High

Communication Overhead Type

All-Reduce (Gradients)

Reduce-Scatter (Gradients), All-Gather (Parameters)

All-Gather (Parameters), Reduce-Scatter (Gradients)

Model Size Scalability

Good

Very Good

Excellent

Ease of Implementation

High

Moderate

Complex (Requires FSDP)

Activation Memory Optimized

Typical Use Case

Moderate model size, bandwidth-constrained clusters

Large models, balanced compute/bandwidth

Extremely large models, memory-bound scenarios

DISTRIBUTED TRAINING

ZeRO Implementations and Frameworks

ZeRO is a memory optimization technique for distributed training that partitions model states across GPUs. Its implementations are realized through specific frameworks and libraries that integrate its stages into the training pipeline.

01

ZeRO Stages (0-3)

ZeRO is implemented in progressive stages, each eliminating a different type of memory redundancy.

  • ZeRO-1 (Optimizer State Partitioning): Partitions the optimizer states (e.g., momentum, variance) across processes, reducing memory by the number of data-parallel workers (D).
  • ZeRO-2 (Add Gradient Partitioning): Additionally partitions gradients across processes, with each process only updating the parameters for which it holds the gradient.
  • ZeRO-3 (Full Parameter Partitioning): Partitions the model parameters themselves. Parameters are gathered on-demand for the forward and backward passes, enabling the training of models larger than the memory of a single GPU.

Each stage builds upon the last, with ZeRO-3 offering the highest memory savings at the cost of increased communication overhead.

05

ZeRO-Offload & ZeRO-Infinity

These are advanced DeepSpeed extensions that push memory savings beyond GPU RAM limits.

  • ZeRO-Offload: Enables training of large models on a single GPU by offloading optimizer states and gradients to CPU memory. It strategically keeps parameters on GPU for compute, using the CPU as a massive memory buffer.
  • ZeRO-Infinity: Goes further by offloading to CPU and NVMe storage. It uses innovative techniques like infinity offload engine and bandwidth-centric partitioning to efficiently leverage terabytes of CPU/NVMe memory, enabling training of trillion-parameter models.
  • Use Case: Critical for research and organizations without massive GPU clusters, democratizing access to large-scale model training.
06

Comparison & Trade-offs

Choosing a ZeRO implementation involves trade-offs between memory efficiency, communication cost, and implementation complexity.

  • Memory vs. Communication: ZeRO-1 saves memory with low overhead. ZeRO-3 saves the most memory but introduces significant communication for parameter gathering (all-gather operations).
  • Framework Choice:
    • DeepSpeed: Most feature-rich and battle-tested for extreme scale. Higher configuration complexity.
    • FSDP: PyTorch-native, easier debugging, and integrates with PyTorch's future roadmap. May have different performance characteristics.
    • Hugging Face: Best for rapid prototyping and using pre-existing model code.
  • Hybrid Parallelism: In practice, ZeRO (data parallelism) is often combined with Tensor Parallelism (intra-layer) and Pipeline Parallelism (inter-layer) to train the largest models.
ZERO REDUNDANCY OPTIMIZER

Frequently Asked Questions

ZeRO is a foundational memory optimization technique for distributed training, enabling the training of models with trillions of parameters by eliminating redundancy across data-parallel processes. These questions address its core mechanisms, stages, and practical implementation.

ZeRO (Zero Redundancy Optimizer) is a memory optimization paradigm for distributed data-parallel training that partitions the three primary model states—optimizer states, gradients, and model parameters—across GPUs to eliminate memory redundancy. It works by sharding these states so each processor only stores a unique slice, using collective communication operations (like All-Gather and Reduce-Scatter) to reconstruct the full states only when needed for computation. This partitioning enables the training of models that are significantly larger than the memory of any single device.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.