A multi-armed bandit is a classic reinforcement learning problem that formalizes the exploration-exploitation tradeoff. It is modeled as an agent repeatedly choosing from multiple actions ("arms"), each providing a reward drawn from an unknown probability distribution. The agent's objective is to maximize its cumulative reward over time by intelligently balancing trying lesser-known arms to gather information (exploration) and selecting the arm currently estimated to be best (exploitation).
