A Multi-Armed Bandit (MAB) is a classic reinforcement learning problem that formalizes the exploration-exploitation trade-off, where an agent must sequentially choose from a set of actions ("arms") with unknown reward distributions to maximize cumulative gain over time. In multi-agent system orchestration, this models the decision of whether to explore an untested agent's capabilities or exploit a known high-performer when allocating a task.
