ZeRO (Zero Redundancy Optimizer) is a memory optimization technique for distributed data-parallel training that partitions the three primary model states—optimizer states, gradients, and model parameters—across GPUs instead of replicating them. This systematic elimination of memory redundancy allows the aggregate GPU memory of a cluster to be used as a single, large pool, enabling the training of models far larger than what could fit on any single device. The technique is implemented through progressive stages, ZeRO-1, ZeRO-2, and ZeRO-3, each sharding more state components to achieve greater memory savings at the cost of increased communication overhead.
Glossary
ZeRO (Zero Redundancy Optimizer)

What is ZeRO (Zero Redundancy Optimizer)?
ZeRO is a foundational memory optimization framework for distributed training that enables the training of models with trillions of parameters by eliminating redundant storage across data-parallel processes.
The core innovation is its sharding strategy. In standard data parallelism, every GPU holds a full copy of the model and its optimizer. ZeRO partitions these states: ZeRO-1 shards only the optimizer states, ZeRO-2 adds gradient partitioning, and ZeRO-3, also known as Fully Sharded Data Parallel (FSDP), shards parameters as well. During the forward and backward passes, parameters are gathered on-demand from across devices and then re-sharded, a process managed efficiently to minimize latency. This makes ZeRO essential for training massive models like those in the Mixture-of-Experts (MoE) and large language model families where memory, not compute, is the primary constraint.
Key Features of ZeRO
ZeRO (Zero Redundancy Optimizer) is a suite of memory optimization stages for distributed training that partitions model states across data-parallel processes to eliminate memory redundancy.
ZeRO Stage 1: Optimizer State Partitioning
Optimizer states (e.g., momentum, variance for Adam) are partitioned across data-parallel processes. Each process stores and updates only its assigned shard, reducing memory footprint by ~4x.
- Mechanism: The optimizer is sharded. During the backward pass, each process gathers the necessary optimizer states for its local parameter update.
- Impact: Enables training of models ~4x larger than standard data parallelism, as optimizer states are a major memory consumer for adaptive optimizers like Adam.
ZeRO Stage 2: Gradient Partitioning
Gradients are partitioned across processes in addition to optimizer states. After the backward pass, each process only retains the gradients for its assigned parameter partition.
- Mechanism: Uses a
reduce-scatteroperation instead of anall-reduce. Each GPU becomes responsible for a unique subset of gradients. - Impact: Provides an additional ~2x memory savings over Stage 1. Total memory reduction is ~8x compared to standard data parallelism, enabling even larger models.
ZeRO Stage 3: Parameter Partitioning
The model parameters themselves are partitioned across GPUs. Each process only stores the parameters for its assigned partition, fetching others on-demand during forward and backward passes.
- Mechanism: Parameters are gathered just-in-time for computation and discarded afterward. This introduces communication overhead but maximizes memory savings.
- Impact: Enables memory reduction linear with the degree of data parallelism (N). In theory, a model of size
Mcan be trained acrossNGPUs with ~M/Nmemory per GPU.
Communication Patterns & Overhead
ZeRO trades increased communication for reduced memory. The overhead is managed through optimized collective operations.
- Stage 1/2: Use
reduce-scatterandall-gatheroperations. Overhead is moderate and often offset by the ability to use larger batch sizes. - Stage 3: Requires frequent
all-gatherandreduce-scatterfor every layer, increasing communication volume. This is mitigated by overlapping communication with computation (communication hiding).
ZeRO Stages: Comparison and Trade-offs
A comparison of the three primary ZeRO (Zero Redundancy Optimizer) stages, detailing their memory partitioning strategies, communication overhead, and scalability trade-offs for distributed training of large models.
| Optimized Component | ZeRO-1 (Optimizer State Partitioning) | ZeRO-2 (Gradient Partitioning) | ZeRO-3 (Parameter Partitioning) |
|---|---|---|---|
Partitioned Model States | Optimizer States | Optimizer States, Gradients | Optimizer States, Gradients, Parameters |
Memory Reduction per GPU | 4x | 8x | Nx (Linear with # of GPUs) |
Communication Volume | Low | Moderate | High |
Communication Overhead Type | All-Reduce (Gradients) | Reduce-Scatter (Gradients), All-Gather (Parameters) | All-Gather (Parameters), Reduce-Scatter (Gradients) |
Model Size Scalability | Good | Very Good | Excellent |
Ease of Implementation | High | Moderate | Complex (Requires FSDP) |
Activation Memory Optimized | |||
Typical Use Case | Moderate model size, bandwidth-constrained clusters | Large models, balanced compute/bandwidth | Extremely large models, memory-bound scenarios |
ZeRO Implementations and Frameworks
ZeRO is a memory optimization technique for distributed training that partitions model states across GPUs. Its implementations are realized through specific frameworks and libraries that integrate its stages into the training pipeline.
ZeRO Stages (0-3)
ZeRO is implemented in progressive stages, each eliminating a different type of memory redundancy.
- ZeRO-1 (Optimizer State Partitioning): Partitions the optimizer states (e.g., momentum, variance) across processes, reducing memory by the number of data-parallel workers (D).
- ZeRO-2 (Add Gradient Partitioning): Additionally partitions gradients across processes, with each process only updating the parameters for which it holds the gradient.
- ZeRO-3 (Full Parameter Partitioning): Partitions the model parameters themselves. Parameters are gathered on-demand for the forward and backward passes, enabling the training of models larger than the memory of a single GPU.
Each stage builds upon the last, with ZeRO-3 offering the highest memory savings at the cost of increased communication overhead.
ZeRO-Offload & ZeRO-Infinity
These are advanced DeepSpeed extensions that push memory savings beyond GPU RAM limits.
- ZeRO-Offload: Enables training of large models on a single GPU by offloading optimizer states and gradients to CPU memory. It strategically keeps parameters on GPU for compute, using the CPU as a massive memory buffer.
- ZeRO-Infinity: Goes further by offloading to CPU and NVMe storage. It uses innovative techniques like infinity offload engine and bandwidth-centric partitioning to efficiently leverage terabytes of CPU/NVMe memory, enabling training of trillion-parameter models.
- Use Case: Critical for research and organizations without massive GPU clusters, democratizing access to large-scale model training.
Comparison & Trade-offs
Choosing a ZeRO implementation involves trade-offs between memory efficiency, communication cost, and implementation complexity.
- Memory vs. Communication: ZeRO-1 saves memory with low overhead. ZeRO-3 saves the most memory but introduces significant communication for parameter gathering (
all-gatheroperations). - Framework Choice:
- DeepSpeed: Most feature-rich and battle-tested for extreme scale. Higher configuration complexity.
- FSDP: PyTorch-native, easier debugging, and integrates with PyTorch's future roadmap. May have different performance characteristics.
- Hugging Face: Best for rapid prototyping and using pre-existing model code.
- Hybrid Parallelism: In practice, ZeRO (data parallelism) is often combined with Tensor Parallelism (intra-layer) and Pipeline Parallelism (inter-layer) to train the largest models.
Frequently Asked Questions
ZeRO is a foundational memory optimization technique for distributed training, enabling the training of models with trillions of parameters by eliminating redundancy across data-parallel processes. These questions address its core mechanisms, stages, and practical implementation.
ZeRO (Zero Redundancy Optimizer) is a memory optimization paradigm for distributed data-parallel training that partitions the three primary model states—optimizer states, gradients, and model parameters—across GPUs to eliminate memory redundancy. It works by sharding these states so each processor only stores a unique slice, using collective communication operations (like All-Gather and Reduce-Scatter) to reconstruct the full states only when needed for computation. This partitioning enables the training of models that are significantly larger than the memory of any single device.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
ZeRO is a cornerstone of large-scale model training. These related concepts are essential for understanding the broader ecosystem of memory optimization and distributed computation.
Mixed Precision Training
Mixed precision training uses lower numerical precision (e.g., BF16 or FP16) for most operations to accelerate computation and reduce memory usage, while maintaining higher precision (FP32) for a master copy of weights and critical operations to ensure numerical stability.
- Key Components:
- FP16/BF16 for Forward/Backward Pass: Faster computation and halved memory for activations and gradients.
- FP32 Master Weights & Optimizer States: Maintains a full-precision copy of weights to avoid underflow/overflow during gradient updates.
- Loss Scaling: Scales the loss to prevent gradient values from vanishing in the lower-precision range.
- Integration with ZeRO: ZeRO's optimizer state partitioning is especially powerful here, as the FP32 master states are a major memory bottleneck. Combining mixed precision with ZeRO-2 or ZeRO-3 is standard for modern large model training.
Model Parallelism
Model parallelism is a distributed training strategy that partitions a single model's layers or components across multiple devices. This contrasts with data parallelism (which ZeRO enhances) where the model is replicated and data is split.
- Tensor Parallelism: Splits individual layer operations (e.g., matrix multiplications within an MLP or attention head) across GPUs. Used in models like Megatron-LM.
- Pipeline Parallelism: Places different groups of model layers on different GPUs. Micro-batching is used to keep the pipeline full and avoid idle devices.
- Relationship to ZeRO: ZeRO is a form of optimized data parallelism. In practice, the largest models use 3D parallelism: a combination of ZeRO/FSDP (Data Parallelism), Tensor Parallelism, and Pipeline Parallelism to distribute both the model and the data.
ZeRO Stages (1, 2, 3)
ZeRO operates in progressive optimization stages, each eliminating a different type of memory redundancy.
- ZeRO-1 (Optimizer State Partitioning): Partitions only the optimizer states (e.g., momentum, variance) across processes. Reduces memory proportional to the data parallelism degree.
- ZeRO-2 (Add Gradient Partitioning): Adds partitioning of gradients in addition to optimizer states. Further reduces memory and enables larger batch sizes.
- ZeRO-3 (Add Parameter Partitioning): Partitions the model parameters themselves across processes. This is the most memory-efficient stage, enabling model sizes that exceed the memory of any single GPU. Parameters are gathered on-demand for computation.
- ZeRO-Offload & ZeRO-Infinity: Advanced variants that offload partitioned states to CPU RAM (Offload) or NVMe storage (Infinity), pushing the boundaries of trainable model size.
Distributed Data Parallel (DDP)
Distributed Data Parallel is the classic, non-sharded data parallelism framework in PyTorch. It serves as the baseline that ZeRO optimizes.
- Baseline Mechanism: Each GPU holds a full replica of the model, optimizer, and its states. Each processes a different shard of the data batch.
- After the backward pass, gradients are averaged across all processes via an
all-reduceoperation. - Each GPU then performs an identical weight update.
- After the backward pass, gradients are averaged across all processes via an
- The Redundancy Problem: This full replication is the source of memory redundancy. Training a 10B parameter model requires 10B parameters * (1 + 2 + 12) bytes ≈ 40GB per GPU just for parameters, gradients, and Adam optimizer states.
- ZeRO as an Optimization: ZeRO can be viewed as a memory-optimized version of DDP. It removes the replication of optimizer states, gradients, and parameters, transforming DDP from a memory-redundant to a memory-efficient paradigm.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us