A Mixture of Experts (MoE) is a neural network architecture designed for conditional computation, where different specialized subnetworks, or 'experts,' are activated for different inputs. A trainable gating network analyzes each input and produces a sparse set of weights, routing the data to only a few relevant experts. This allows the total model capacity to be massive—often hundreds of billions of parameters—while keeping the computational cost per input low, as only a small subset of parameters is used during inference.
