Reference

Model Parallelism

Model parallelism is a distributed training strategy that partitions a single neural network's layers or tensors across multiple devices (GPUs/TPUs) to enable the training of models that exceed the memory capacity of any single device.

Product and engineering team shaping an AI system design around a planning wall.

DISTRIBUTED TRAINING

What is Model Parallelism?

Model parallelism is a core distributed training strategy for scaling large artificial intelligence models beyond the memory limits of a single accelerator.

Model parallelism is a distributed training strategy that partitions a single neural network's layers, operators, or tensors across multiple hardware devices (e.g., GPUs or TPUs) to train models whose parameters exceed the memory capacity of one device. Unlike data parallelism, which replicates the entire model, this approach splits the model itself, with each device responsible for computing a distinct segment of the forward and backward passes, requiring synchronized communication of activations and gradients between stages.

Common implementations include pipeline parallelism, which partitions the model by layer groups, and tensor parallelism, which splits individual layer operations (like matrix multiplications) across devices. This technique is foundational for training modern large language models (LLMs) and vision transformers. It is often combined with data parallelism and memory optimization techniques like the Zero Redundancy Optimizer (ZeRO) to achieve maximum scaling efficiency.

MODEL PARALLELISM

Key Parallelism Strategies

Model parallelism is a distributed training strategy that partitions a single model's layers or tensors across multiple devices (GPUs/TPUs) to train models that are too large to fit on one device. The following cards detail its core implementation strategies.

Tensor Parallelism

Tensor parallelism splits individual weight matrices or tensors across devices, performing the computation for a single layer in parallel. This is often applied to the large linear layers within transformer blocks.

Key Mechanism: For a matrix multiplication Y = XW, the weight matrix W is partitioned column-wise (or row-wise) across devices. Each device computes a partial result, and communication (like an all-reduce operation) is required to combine outputs.
Primary Use: Essential for models where even a single layer's parameters exceed a device's memory, such as in massive Mixture of Experts (MoE) layers or dense feed-forward networks in large language models.
Example: In the Megatron-LM framework, the self-attention and MLP layers of transformers are split across GPUs using tensor parallelism.

EXPLORE

Pipeline Parallelism

Pipeline parallelism partitions the model's layers (vertical slicing) across devices. Each device holds a consecutive subset of the model's layers, forming a processing pipeline.

Key Mechanism: A mini-batch of data is split into smaller micro-batches. These micro-batches are fed into the pipeline sequentially. While one device processes a micro-batch for its set of layers, other devices can be processing different micro-batches for their respective stages, aiming to keep all devices busy.
Challenge: Pipeline bubbles—idle time created when the pipeline is filling or draining—reduce hardware utilization. Techniques like 1F1B (One Forward pass followed by One Backward pass) scheduling are used to mitigate this.
Primary Use: Training extremely deep models (e.g., models with hundreds of layers) where the sequential depth is the limiting factor.

EXPLORE

Expert Parallelism (for MoE)

Expert parallelism is a specialized strategy for Mixture of Experts (MoE) models. It distributes the different 'expert' sub-networks across the available devices.

Key Mechanism: The gating/router network selects which experts are relevant for a given input token. The tokens are then dispatched (via an all-to-all communication operation) to the devices hosting those specific experts for processing. The results are gathered and combined.
Primary Use: Enables the training of models with a massive number of parameters (e.g., trillions) by activating only a small, conditional subset of parameters per token. This provides large model capacity without a proportional increase in FLOPs per token.
Example: Used extensively in models like Switch Transformers and GLaM, where thousands of experts are sharded across many devices.

EXPLORE

3D Parallelism (Combined Strategy)

3D parallelism combines data parallelism, tensor parallelism, and pipeline parallelism to train models at the extreme scale of hundreds or thousands of accelerators.

Key Mechanism:
- Data Parallelism replicates the entire model across groups of devices, splitting the batch of data.
- Tensor Parallelism splits individual layers within a device group.
- Pipeline Parallelism splits layers across different device groups.
Communication Patterns: Each parallelism type introduces a different communication cost (all-reduce for data, all-reduce for tensor, point-to-point for pipeline). The combined strategy must carefully orchestrate these to minimize bottlenecks.
Primary Use: Training state-of-the-art foundation models (e.g., GPT-3, GPT-4, BLOOM) where the model size far exceeds the memory of any single device or even a single node.

EXPLORE

Sequence Parallelism

Sequence parallelism partitions the sequence length dimension (the batch of tokens) across devices. This is crucial for training models with very long context windows.

Key Mechanism: For operations like layer normalization or dropout, which require statistics computed across the sequence dimension, an all-gather communication is needed. For attention, it enables the distribution of the massive Key-Value (KV) cache.
Primary Use: Directly addresses the memory bottleneck of the activations (not just parameters) during training for long sequences. It reduces the peak activation memory per device by a factor proportional to the degree of sequence parallelism.
Relation to Other Techniques: Often used in conjunction with tensor or pipeline parallelism. It is conceptually similar to splitting the batch dimension in data parallelism, but applied to the sequence length within a single sample or micro-batch.

EXPLORE

Zero Redundancy Optimizer (ZeRO)

The Zero Redundancy Optimizer (ZeRO) is a memory optimization paradigm that enables efficient data-parallel training by partitioning the optimizer states, gradients, and model parameters across devices, eliminating memory redundancy.

Key Stages:
- ZeRO-Stage 1: Partitions optimizer states (e.g., momentum, variance) across devices.
- ZeRO-Stage 2: Also partitions gradients, with an all-gather only for the parameters needed for the local layer update.
- ZeRO-Stage 3: Also partitions the model parameters themselves, gathering them just-in-time for forward/backward computation.
Primary Use: While not pure model parallelism, ZeRO is a foundational technique that makes large-scale data parallelism feasible. It is often combined with model parallelism (e.g., pipeline or tensor) in frameworks like DeepSpeed to achieve optimal memory efficiency and enable the training of models with over a trillion parameters.

EXPLORE

DISTRIBUTED TRAINING STRATEGIES

Model Parallelism vs. Data Parallelism

A comparison of the two primary paradigms for parallelizing neural network training across multiple hardware accelerators (GPUs/TPUs).

Feature	Model Parallelism	Data Parallelism	Hybrid Parallelism
Primary Partitioning Unit	Model layers, tensors, or parameters	Training data batches	Both model and data
Goal	Train models larger than a single device's memory	Accelerate training by processing more data simultaneously	Train massive models on massive datasets
Communication Pattern	P2P communication of activations/gradients between layers	All-reduce synchronization of gradients	Combined P2P and all-reduce
Memory Footprint per Device	Stores only a partition of the model	Stores the entire model	Stores a partition of the model
Ideal Use Case	Models exceeding single-device memory (e.g., >100B parameters)	Models that fit on one device, with large datasets	Extremely large foundation model training
Hardware Utilization	Can be lower due to sequential dependencies (pipeline bubbles)	Typically high, as devices compute in parallel	Complex, requires careful balancing
Implementation Complexity	High (requires manual layer splitting or automated search)	Low (framework-supported, e.g., PyTorch DDP)	Very High (e.g., using Megatron-LM, DeepSpeed)
Fault Tolerance	Low (failure of one device halts entire forward/backward pass)	Moderate (a device can be dropped from a batch)	Low (complex dependencies increase failure impact)

MODEL PARALLELISM

Frequently Asked Questions

Model parallelism is a foundational technique for training massive neural networks. This FAQ addresses its core mechanisms, distinctions from other strategies, and its critical role in modern AI development.

Model parallelism is a distributed training strategy that partitions a single neural network's layers, parameters, or tensors across multiple computational devices (e.g., GPUs or TPUs) to train models whose memory footprint exceeds the capacity of any single device. It works by splitting the model's computational graph. For example, in pipeline parallelism, different layers are placed on different devices, and a micro-batch of data flows through this pipeline in a staged manner. In tensor parallelism, individual layers (like the linear transformations within a transformer block) are split across devices, requiring synchronized all-reduce communication after each operation to combine results. The primary goal is to overcome the memory limitations of individual accelerators, enabling the training of models with hundreds of billions or trillions of parameters.

DISTRIBUTED TRAINING & OPTIMIZATION

Related Terms

Model parallelism is one of several advanced techniques for training and running large-scale neural networks. These related concepts address the challenges of memory, computation, and communication in modern AI systems.

Data Parallelism

A distributed training strategy where identical copies of the entire model are placed on multiple devices (GPUs/TPUs). Each device processes a different subset of the training data (a mini-batch) in parallel. The key distinction from model parallelism is that gradients are synchronized across devices after each forward/backward pass via an all-reduce operation to update all model copies consistently. This is highly effective for scaling batch size when the model fits on a single device.

EXPLORE

Pipeline Parallelism

A hybrid approach that combines model and data parallelism. The model is partitioned layer-by-layer across devices (model parallelism), and the training batch is split into micro-batches (data parallelism). Devices operate on different micro-batches simultaneously in a pipelined fashion, similar to a CPU instruction pipeline. This reduces the 'bubble' of idle time present in naive model parallelism. Frameworks like GPipe and PyTorch's Fully Sharded Data Parallel (FSDP) employ sophisticated pipeline scheduling.

Tensor Parallelism

A finer-grained form of model parallelism where individual weight matrices or tensor operations within a layer are split across devices. For example, the large matrix multiplications in a Transformer's feed-forward network or attention heads can be distributed. This requires careful synchronization and communication between devices during the forward and backward passes. It's often used within a single server node with high-bandwidth interconnects (e.g., NVLink) and is a core component of libraries like Megatron-LM.

Zero Redundancy Optimizer (ZeRO)

A memory optimization paradigm for data-parallel training that partitions the optimizer states, gradients, and model parameters across devices instead of replicating them. This eliminates memory redundancy, allowing for the training of models much larger than what can fit on any single device's memory. ZeRO operates in stages:

Stage 1: Partitions optimizer states.
Stage 2: Partitions gradients.
Stage 3: Partitions model parameters. It's implemented in DeepSpeed and integrated into PyTorch as Fully Sharded Data Parallel (FSDP).

EXPLORE

Mixture of Experts (MoE)

A conditional computation architecture that enables sparsely activated models. The system consists of many specialized sub-networks ('experts'). A gating network routes each input token to only a small, selected subset of experts (e.g., 2 out of 128). This allows the total parameter count to grow massively (e.g., trillion-parameter models) without a proportional increase in computation per token. MoE layers are often combined with model parallelism, where different experts are placed on different devices. Examples include Switch Transformers and Mixtral 8x7B.

Gradient Checkpointing

A technique that trades compute for memory during training. Instead of storing all intermediate activations (needed for the backward pass) for every layer, it selectively saves only a subset ('checkpoints'). The non-checkpointed activations are recomputed on-demand during the backward pass. This can dramatically reduce peak GPU memory usage, often by a factor of 4-5x, at the cost of about 30% more compute time. It is essential for training very deep networks where memory, not compute, is the primary bottleneck.