Glossary

ZeRO (Zero Redundancy Optimizer)

ZeRO is a memory optimization technique for distributed training that partitions model states (optimizer states, gradients, parameters) across data-parallel processes to eliminate memory redundancy.

Get in touch Learn more

Data scientist building training data pipeline on laptop, data preprocessing visible, technical workspace.

PARAMETER-EFFICIENT FINE-TUNING

What is ZeRO (Zero Redundancy Optimizer)?

ZeRO is a foundational memory optimization framework for distributed training that enables the training of models with trillions of parameters by eliminating redundant storage across data-parallel processes.

ZeRO (Zero Redundancy Optimizer) is a memory optimization technique for distributed data-parallel training that partitions the three primary model states—optimizer states, gradients, and model parameters—across GPUs instead of replicating them. This systematic elimination of memory redundancy allows the aggregate GPU memory of a cluster to be used as a single, large pool, enabling the training of models far larger than what could fit on any single device. The technique is implemented through progressive stages, ZeRO-1, ZeRO-2, and ZeRO-3, each sharding more state components to achieve greater memory savings at the cost of increased communication overhead.

The core innovation is its sharding strategy. In standard data parallelism, every GPU holds a full copy of the model and its optimizer. ZeRO partitions these states: ZeRO-1 shards only the optimizer states, ZeRO-2 adds gradient partitioning, and ZeRO-3, also known as Fully Sharded Data Parallel (FSDP), shards parameters as well. During the forward and backward passes, parameters are gathered on-demand from across devices and then re-sharded, a process managed efficiently to minimize latency. This makes ZeRO essential for training massive models like those in the Mixture-of-Experts (MoE) and large language model families where memory, not compute, is the primary constraint.

MEMORY OPTIMIZATION

Key Features of ZeRO

ZeRO (Zero Redundancy Optimizer) is a suite of memory optimization stages for distributed training that partitions model states across data-parallel processes to eliminate memory redundancy.

ZeRO Stage 1: Optimizer State Partitioning

Optimizer states (e.g., momentum, variance for Adam) are partitioned across data-parallel processes. Each process stores and updates only its assigned shard, reducing memory footprint by ~4x.

Mechanism: The optimizer is sharded. During the backward pass, each process gathers the necessary optimizer states for its local parameter update.
Impact: Enables training of models ~4x larger than standard data parallelism, as optimizer states are a major memory consumer for adaptive optimizers like Adam.

ZeRO Stage 2: Gradient Partitioning

Gradients are partitioned across processes in addition to optimizer states. After the backward pass, each process only retains the gradients for its assigned parameter partition.

Mechanism: Uses a reduce-scatter operation instead of an all-reduce. Each GPU becomes responsible for a unique subset of gradients.
Impact: Provides an additional ~2x memory savings over Stage 1. Total memory reduction is ~8x compared to standard data parallelism, enabling even larger models.

ZeRO Stage 3: Parameter Partitioning

The model parameters themselves are partitioned across GPUs. Each process only stores the parameters for its assigned partition, fetching others on-demand during forward and backward passes.

Mechanism: Parameters are gathered just-in-time for computation and discarded afterward. This introduces communication overhead but maximizes memory savings.
Impact: Enables memory reduction linear with the degree of data parallelism (N). In theory, a model of size M can be trained across N GPUs with ~M/N memory per GPU.

ZeRO-Offload & ZeRO-Infinity

Extensions that leverage CPU and NVMe memory to train models larger than aggregate GPU memory.

ZeRO-Offload: Moves optimizer states, gradients, and some parameters to CPU memory, using the GPU for compute. Enables training of 10B+ parameter models on a single GPU.
ZeRO-Infinity: Extends offloading to NVMe storage, enabling the training of trillion-parameter models by using the entire memory hierarchy (GPU → CPU → NVMe).

EXPLORE

Communication Patterns & Overhead

ZeRO trades increased communication for reduced memory. The overhead is managed through optimized collective operations.

Stage 1/2: Use reduce-scatter and all-gather operations. Overhead is moderate and often offset by the ability to use larger batch sizes.
Stage 3: Requires frequent all-gather and reduce-scatter for every layer, increasing communication volume. This is mitigated by overlapping communication with computation (communication hiding).

Fully Sharded Data Parallel (FSDP)

FSDP is the PyTorch-native implementation of ZeRO-3. It simplifies adoption by integrating deeply with PyTorch's autograd and module system.

Key Features: Automates sharding, gradient reduction, and parameter gathering. Supports flexible sharding strategies (full shard, shard-on-grad, hybrid sharding).
Use Case: The standard method for training large models in PyTorch, replacing the older DistributedDataParallel (DDP) for memory-intensive workloads.

EXPLORE

MEMORY OPTIMIZATION STRATEGIES

ZeRO Stages: Comparison and Trade-offs

A comparison of the three primary ZeRO (Zero Redundancy Optimizer) stages, detailing their memory partitioning strategies, communication overhead, and scalability trade-offs for distributed training of large models.

Optimized Component	ZeRO-1 (Optimizer State Partitioning)	ZeRO-2 (Gradient Partitioning)	ZeRO-3 (Parameter Partitioning)
Partitioned Model States	Optimizer States	Optimizer States, Gradients	Optimizer States, Gradients, Parameters
Memory Reduction per GPU	4x	8x	Nx (Linear with # of GPUs)
Communication Volume	Low	Moderate	High
Communication Overhead Type	All-Reduce (Gradients)	Reduce-Scatter (Gradients), All-Gather (Parameters)	All-Gather (Parameters), Reduce-Scatter (Gradients)
Model Size Scalability	Good	Very Good	Excellent
Ease of Implementation	High	Moderate	Complex (Requires FSDP)
Activation Memory Optimized
Typical Use Case	Moderate model size, bandwidth-constrained clusters	Large models, balanced compute/bandwidth	Extremely large models, memory-bound scenarios

DISTRIBUTED TRAINING

ZeRO Implementations and Frameworks

ZeRO is a memory optimization technique for distributed training that partitions model states across GPUs. Its implementations are realized through specific frameworks and libraries that integrate its stages into the training pipeline.

ZeRO Stages (0-3)

ZeRO is implemented in progressive stages, each eliminating a different type of memory redundancy.

ZeRO-1 (Optimizer State Partitioning): Partitions the optimizer states (e.g., momentum, variance) across processes, reducing memory by the number of data-parallel workers (D).
ZeRO-2 (Add Gradient Partitioning): Additionally partitions gradients across processes, with each process only updating the parameters for which it holds the gradient.
ZeRO-3 (Full Parameter Partitioning): Partitions the model parameters themselves. Parameters are gathered on-demand for the forward and backward passes, enabling the training of models larger than the memory of a single GPU.

Each stage builds upon the last, with ZeRO-3 offering the highest memory savings at the cost of increased communication overhead.

DeepSpeed

DeepSpeed is a deep learning optimization library from Microsoft that provides the canonical, production-ready implementation of ZeRO. It is a foundational framework for large-scale model training.

Core Integration: DeepSpeed's ZeroRedundancyOptimizer implements all ZeRO stages and is deeply integrated with PyTorch.
Additional Optimizations: It complements ZeRO with features like ZeRO-Offload (to CPU memory), ZeRO-Infinity (to NVMe storage), and advanced communication overlays.
Widespread Adoption: Used to train models like MT-530B and BLOOM. It is often the backend for higher-level frameworks like Hugging Face's Trainer.

EXPLORE

Fully Sharded Data Parallel (FSDP)

Fully Sharded Data Parallel is PyTorch's native implementation of the ZeRO-3 paradigm, introduced in PyTorch 1.11. It is designed for seamless integration within the PyTorch ecosystem.

PyTorch Native: Implemented as torch.distributed.fsdp.FullyShardedDataParallel. It shards parameters, gradients, and optimizer states.
Flexible Sharding: Supports multiple sharding strategies (e.g., FULL_SHARD, SHARD_GRAD_OP, NO_SHARD) and can apply hybrid sharding (combining data, model, and pipeline parallelism).
Automatic Integration: Works with standard PyTorch modules, torch.compile, and custom autograd functions, making it easier to adopt for existing codebases.

EXPLORE

Hugging Face Accelerate & Trainer

Hugging Face provides high-level abstractions that integrate ZeRO implementations, simplifying distributed training for practitioners.

Accelerate Library: Offers a unified API for launching training across various setups (single GPU, multi-GPU, TPU). It can leverage DeepSpeed's ZeRO configs with minimal code changes via accelerate config.
Trainer Integration: The Trainer class from transformers natively supports DeepSpeed configuration files. Users can enable ZeRO by simply providing a deepspeed config JSON file, abstracting away the complex distributed code.
Accessibility: This makes advanced memory optimization techniques accessible without requiring deep expertise in distributed systems programming.

EXPLORE

ZeRO-Offload & ZeRO-Infinity

These are advanced DeepSpeed extensions that push memory savings beyond GPU RAM limits.

ZeRO-Offload: Enables training of large models on a single GPU by offloading optimizer states and gradients to CPU memory. It strategically keeps parameters on GPU for compute, using the CPU as a massive memory buffer.
ZeRO-Infinity: Goes further by offloading to CPU and NVMe storage. It uses innovative techniques like infinity offload engine and bandwidth-centric partitioning to efficiently leverage terabytes of CPU/NVMe memory, enabling training of trillion-parameter models.
Use Case: Critical for research and organizations without massive GPU clusters, democratizing access to large-scale model training.

Comparison & Trade-offs

Choosing a ZeRO implementation involves trade-offs between memory efficiency, communication cost, and implementation complexity.

Memory vs. Communication: ZeRO-1 saves memory with low overhead. ZeRO-3 saves the most memory but introduces significant communication for parameter gathering (all-gather operations).
Framework Choice:
- DeepSpeed: Most feature-rich and battle-tested for extreme scale. Higher configuration complexity.
- FSDP: PyTorch-native, easier debugging, and integrates with PyTorch's future roadmap. May have different performance characteristics.
- Hugging Face: Best for rapid prototyping and using pre-existing model code.
Hybrid Parallelism: In practice, ZeRO (data parallelism) is often combined with Tensor Parallelism (intra-layer) and Pipeline Parallelism (inter-layer) to train the largest models.

ZERO REDUNDANCY OPTIMIZER

Frequently Asked Questions

ZeRO is a foundational memory optimization technique for distributed training, enabling the training of models with trillions of parameters by eliminating redundancy across data-parallel processes. These questions address its core mechanisms, stages, and practical implementation.

ZeRO (Zero Redundancy Optimizer) is a memory optimization paradigm for distributed data-parallel training that partitions the three primary model states—optimizer states, gradients, and model parameters—across GPUs to eliminate memory redundancy. It works by sharding these states so each processor only stores a unique slice, using collective communication operations (like All-Gather and Reduce-Scatter) to reconstruct the full states only when needed for computation. This partitioning enables the training of models that are significantly larger than the memory of any single device.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

MEMORY & DISTRIBUTED TRAINING

Related Terms

ZeRO is a cornerstone of large-scale model training. These related concepts are essential for understanding the broader ecosystem of memory optimization and distributed computation.

Fully Sharded Data Parallel (FSDP)

Fully Sharded Data Parallel is the PyTorch-native implementation of the ZeRO-3 optimization stage. It shards all model components—parameters, gradients, and optimizer states—across all data-parallel processes. Unlike classic data parallelism, which replicates the entire model, FSDP eliminates memory redundancy, enabling the training of models that far exceed the memory of a single GPU.

Core Mechanism: Each GPU stores only a shard of the full model. During the forward pass, it gathers the necessary parameters from other GPUs, then discards them after computation.
Integration: Deeply integrated into PyTorch's torch.distributed module, offering a more user-friendly API than manual ZeRO implementations.
Use Case: The standard method for training multi-billion parameter models in frameworks like Hugging Face Transformers and Meta's Llama training code.

EXPLORE

Gradient Checkpointing

Gradient checkpointing (or activation checkpointing) is a memory optimization technique that trades compute for memory. It reduces the memory cost of storing intermediate activations during the forward pass, which are needed to compute gradients during the backward pass.

How it Works: Instead of saving all activations, the system selectively saves only a subset (checkpoints). During the backward pass, it recomputes the non-saved activations on-the-fly from the nearest checkpoint.
Memory/Compute Trade-off: Can reduce activation memory by up to 80%, but increases training time by approximately 25% due to the extra recomputation.
Synergy with ZeRO: Often used in conjunction with ZeRO/FSDP. While ZeRO optimizes memory for parameters and optimizer states, checkpointing optimizes memory for activations, providing a comprehensive memory reduction strategy.

EXPLORE

Mixed Precision Training

Mixed precision training uses lower numerical precision (e.g., BF16 or FP16) for most operations to accelerate computation and reduce memory usage, while maintaining higher precision (FP32) for a master copy of weights and critical operations to ensure numerical stability.

Key Components:
- FP16/BF16 for Forward/Backward Pass: Faster computation and halved memory for activations and gradients.
- FP32 Master Weights & Optimizer States: Maintains a full-precision copy of weights to avoid underflow/overflow during gradient updates.
- Loss Scaling: Scales the loss to prevent gradient values from vanishing in the lower-precision range.
Integration with ZeRO: ZeRO's optimizer state partitioning is especially powerful here, as the FP32 master states are a major memory bottleneck. Combining mixed precision with ZeRO-2 or ZeRO-3 is standard for modern large model training.

Model Parallelism

Model parallelism is a distributed training strategy that partitions a single model's layers or components across multiple devices. This contrasts with data parallelism (which ZeRO enhances) where the model is replicated and data is split.

Tensor Parallelism: Splits individual layer operations (e.g., matrix multiplications within an MLP or attention head) across GPUs. Used in models like Megatron-LM.
Pipeline Parallelism: Places different groups of model layers on different GPUs. Micro-batching is used to keep the pipeline full and avoid idle devices.
Relationship to ZeRO: ZeRO is a form of optimized data parallelism. In practice, the largest models use 3D parallelism: a combination of ZeRO/FSDP (Data Parallelism), Tensor Parallelism, and Pipeline Parallelism to distribute both the model and the data.

ZeRO Stages (1, 2, 3)

ZeRO operates in progressive optimization stages, each eliminating a different type of memory redundancy.

ZeRO-1 (Optimizer State Partitioning): Partitions only the optimizer states (e.g., momentum, variance) across processes. Reduces memory proportional to the data parallelism degree.
ZeRO-2 (Add Gradient Partitioning): Adds partitioning of gradients in addition to optimizer states. Further reduces memory and enables larger batch sizes.
ZeRO-3 (Add Parameter Partitioning): Partitions the model parameters themselves across processes. This is the most memory-efficient stage, enabling model sizes that exceed the memory of any single GPU. Parameters are gathered on-demand for computation.
ZeRO-Offload & ZeRO-Infinity: Advanced variants that offload partitioned states to CPU RAM (Offload) or NVMe storage (Infinity), pushing the boundaries of trainable model size.

Distributed Data Parallel (DDP)

Distributed Data Parallel is the classic, non-sharded data parallelism framework in PyTorch. It serves as the baseline that ZeRO optimizes.

Baseline Mechanism: Each GPU holds a full replica of the model, optimizer, and its states. Each processes a different shard of the data batch.
- After the backward pass, gradients are averaged across all processes via an all-reduce operation.
- Each GPU then performs an identical weight update.
The Redundancy Problem: This full replication is the source of memory redundancy. Training a 10B parameter model requires 10B parameters * (1 + 2 + 12) bytes ≈ 40GB per GPU just for parameters, gradients, and Adam optimizer states.
ZeRO as an Optimization: ZeRO can be viewed as a memory-optimized version of DDP. It removes the replication of optimizer states, gradients, and parameters, transforming DDP from a memory-redundant to a memory-efficient paradigm.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

ZeRO (Zero Redundancy Optimizer)

What is ZeRO (Zero Redundancy Optimizer)?

Key Features of ZeRO

ZeRO Stage 1: Optimizer State Partitioning

ZeRO Stage 2: Gradient Partitioning

ZeRO Stage 3: Parameter Partitioning

ZeRO-Offload & ZeRO-Infinity

Communication Patterns & Overhead

Fully Sharded Data Parallel (FSDP)

ZeRO Stages: Comparison and Trade-offs

ZeRO Implementations and Frameworks

ZeRO Stages (0-3)

DeepSpeed

Fully Sharded Data Parallel (FSDP)

Hugging Face Accelerate & Trainer

ZeRO-Offload & ZeRO-Infinity

Comparison & Trade-offs

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Fully Sharded Data Parallel (FSDP)

Gradient Checkpointing

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there