The Zero Redundancy Optimizer (ZeRO) is a suite of memory optimization techniques for distributed deep learning that partitions the model state—comprising optimizer states, gradients, and parameters—across all available devices. This eliminates the memory redundancy inherent in traditional data-parallel training, where each GPU holds a full copy of the entire model. By sharding these components, ZeRO enables the training of models that are orders of magnitude larger than the memory of any single accelerator.
