Inferensys

Glossary

Multi-Armed Bandit

The multi-armed bandit is a classic reinforcement learning problem that formalizes the exploration-exploitation tradeoff, where an agent must choose between multiple actions with unknown reward distributions to maximize cumulative gain.
Developer reviewing multi-agent chat interface on laptop, agent conversation logs visible, casual coding session at WeWork desk.
REINFORCEMENT LEARNING

What is a Multi-Armed Bandit?

A formal framework for decision-making under uncertainty that models the fundamental tradeoff between exploring new options and exploiting known rewards.

A multi-armed bandit is a classic reinforcement learning problem that formalizes the exploration-exploitation tradeoff. It is modeled as an agent repeatedly choosing from multiple actions ("arms"), each providing a reward drawn from an unknown probability distribution. The agent's objective is to maximize its cumulative reward over time by intelligently balancing trying lesser-known arms to gather information (exploration) and selecting the arm currently estimated to be best (exploitation).

Algorithms like Upper Confidence Bound (UCB) and Thompson Sampling provide principled strategies for this tradeoff. UCB selects arms based on their estimated reward plus an uncertainty bonus, while Thompson Sampling uses a Bayesian approach, sampling rewards from posterior distributions. These frameworks are foundational to A/B testing, clinical trial design, and recommendation systems, where they enable efficient online optimization without the need for exhaustive exploration.

EXPLORATION-EXPLOITATION STRATEGIES

Key Multi-Armed Bandit Algorithms

Multi-armed bandit algorithms provide formal strategies for balancing the exploration of uncertain options with the exploitation of known high-reward actions. These algorithms are foundational to adaptive systems in recommendation engines, clinical trials, and online advertising.

01

Epsilon-Greedy

The epsilon-greedy algorithm is the simplest and most widely used bandit strategy. It operates by selecting the action currently estimated to be the best (exploitation) with probability 1 - ε, and selecting a random action (exploration) with probability ε. The parameter ε (epsilon) is typically a small constant, like 0.1.

  • Pros: Extremely simple to implement and understand.
  • Cons: Exploration is undirected and inefficient; continues random exploration even after the optimal arm is identified.
  • Use Case: Ideal for initial prototyping or environments where computational simplicity is paramount.
02

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) family of algorithms selects actions based on an optimistic estimate of their potential reward. The core principle is optimism in the face of uncertainty. It chooses the arm that maximizes a score: the current sample mean plus a confidence interval that shrinks as the arm is pulled more often.

  • Formula: The canonical UCB1 score for arm i is: mean_i + sqrt(2 * ln(total_pulls) / pulls_i).
  • Pros: Provides a deterministic, theoretically-grounded balance between exploration and exploitation with strong regret bounds.
  • Cons: Requires knowledge of total time horizon for some variants; can be sensitive to the scaling of rewards.
  • Use Case: Online advertising A/B testing, where a provable regret guarantee is desirable.
03

Thompson Sampling

Thompson Sampling is a Bayesian probabilistic algorithm. It maintains a probability distribution (posterior) over the expected reward of each arm. At each step, it samples a reward estimate from each arm's posterior distribution and selects the arm with the highest sampled value. It then updates the posterior based on the observed reward.

  • Mechanism: Naturally balances exploration and exploitation; arms with high uncertainty have wider posteriors, giving them a chance to be sampled.
  • Pros: Often achieves lower empirical regret than UCB; elegantly handles contextual and non-stationary bandits.
  • Cons: Computationally more intensive than epsilon-greedy; requires choosing a prior distribution.
  • Use Case: Personalized recommendation systems and dynamic pricing, where user preferences are modeled probabilistically.
04

EXP3 (Adversarial Bandits)

The Exponential-weight algorithm for Exploration and Exploitation (EXP3) is designed for the adversarial bandit setting, where reward distributions are not stationary but can change arbitrarily over time, potentially in response to the agent's actions.

  • Mechanism: Maintains a weight for each arm and selects an arm with a probability proportional to its exponentially weighted cumulative reward. It mixes in a uniform exploration probability to ensure no arm is completely ignored.
  • Key Feature: Provides strong performance guarantees even in worst-case, non-stochastic environments.
  • Cons: Generally performs worse than stochastic algorithms (like UCB or Thompson) in benign, stationary environments.
  • Use Case: Online routing in volatile networks, or any scenario where reward mechanisms may be actively opposed.
05

LinUCB (Contextual Bandits)

LinUCB extends the UCB principle to contextual bandits, where side information (context) is available at each decision point. It assumes the expected reward of an arm is a linear function of the context vector.

  • Mechanism: For each arm, it learns a linear model. The UCB score becomes the linear prediction plus an uncertainty term derived from the covariance of the observed contexts for that arm.
  • Pros: Enables personalization by incorporating user/context features into the decision.
  • Cons: Assumes a linear relationship between context and reward; computational cost scales with the number of arms and context dimension.
  • Use Case: News article recommendation, where the context includes user demographics and article topics.
06

Gittins Index (Bayesian Optimal)

The Gittins Index provides the theoretically optimal solution for the Bayesian multi-armed bandit problem with a geometric discounting of future rewards. It assigns a single index value to each arm based on its current posterior distribution and the discount factor.

  • Optimality: The policy of always playing the arm with the highest Gittins Index is optimal for the infinite-horizon discounted reward criterion.
  • Challenge: Computing the exact Gittins Index is complex and often requires pre-computed tables or approximations.
  • Use Case: Foundational in economics and optimal stopping theory; used in scenarios like research project selection where a precise, long-term optimal policy is required, despite computational cost.
MULTI-ARMED BANDIT

Frequently Asked Questions

The multi-armed bandit (MAB) problem is a foundational framework in reinforcement learning and decision theory that formalizes the exploration-exploitation tradeoff. These questions address its core mechanisms, algorithms, and applications in AI systems.

The Multi-Armed Bandit (MAB) problem is a sequential decision-making framework where an agent repeatedly chooses from a set of actions (the 'bandits' or 'arms'), each providing a stochastic reward drawn from an unknown probability distribution, with the goal of maximizing cumulative reward over time. It is the canonical model for the exploration-exploitation tradeoff, forcing the agent to balance trying new arms to gather information (exploration) and selecting the arm currently believed to be best (exploitation). The name is derived from a gambler facing a row of slot machines ('one-armed bandits'), each with an unknown payout rate.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.