Model parallelism is a distributed training strategy that partitions a single neural network's layers, operators, or tensors across multiple hardware devices (e.g., GPUs or TPUs) to train models whose parameters exceed the memory capacity of one device. Unlike data parallelism, which replicates the entire model, this approach splits the model itself, with each device responsible for computing a distinct segment of the forward and backward passes, requiring synchronized communication of activations and gradients between stages.
