Mixture of Experts (MoE) is a conditional computation architecture where a routing network dynamically selects a sparse subset of specialized 'expert' sub-networks to process each input token. This design decouples model parameter count from computational cost, allowing the creation of models with trillions of parameters while activating only a small fraction—such as two experts out of hundreds—per forward pass. It is a cornerstone technique for building large language models (LLMs) like GPT-4 and Mixtral that require vast knowledge without proportional latency.
