Multi-Armed Bandit: Definition & Algorithms

Free 30-minute system review for production AI teams

Book a call

Guides on retrieval, evaluation, orchestration, and production AI delivery

Browse guides

Need help designing, building, or shipping a production AI system?

Get in touch

Compare architectures, tradeoffs, and implementation paths

See comparisons

Free 30-minute system review for production AI teams

Book a call

Guides on retrieval, evaluation, orchestration, and production AI delivery

Browse guides

Need help designing, building, or shipping a production AI system?

Get in touch

Compare architectures, tradeoffs, and implementation paths

See comparisons

Multi-Armed Bandit: Definition & Algorithms | Inference Systems

EXPLORATION-EXPLOITATION STRATEGIES

Key Multi-Armed Bandit Algorithms

Multi-armed bandit algorithms provide formal strategies for balancing the exploration of uncertain options with the exploitation of known high-reward actions. These algorithms are foundational to adaptive systems in recommendation engines, clinical trials, and online advertising.

Epsilon-Greedy

The epsilon-greedy algorithm is the simplest and most widely used bandit strategy. It operates by selecting the action currently estimated to be the best (exploitation) with probability 1 - ε, and selecting a random action (exploration) with probability ε. The parameter ε (epsilon) is typically a small constant, like 0.1.

Pros: Extremely simple to implement and understand.
Cons: Exploration is undirected and inefficient; continues random exploration even after the optimal arm is identified.
Use Case: Ideal for initial prototyping or environments where computational simplicity is paramount.

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) family of algorithms selects actions based on an optimistic estimate of their potential reward. The core principle is optimism in the face of uncertainty. It chooses the arm that maximizes a score: the current sample mean plus a confidence interval that shrinks as the arm is pulled more often.

Formula: The canonical UCB1 score for arm i is: mean_i + sqrt(2 * ln(total_pulls) / pulls_i).
Pros: Provides a deterministic, theoretically-grounded balance between exploration and exploitation with strong regret bounds.
Cons: Requires knowledge of total time horizon for some variants; can be sensitive to the scaling of rewards.
Use Case: Online advertising A/B testing, where a provable regret guarantee is desirable.

Thompson Sampling

Thompson Sampling is a Bayesian probabilistic algorithm. It maintains a probability distribution (posterior) over the expected reward of each arm. At each step, it samples a reward estimate from each arm's posterior distribution and selects the arm with the highest sampled value. It then updates the posterior based on the observed reward.

Mechanism: Naturally balances exploration and exploitation; arms with high uncertainty have wider posteriors, giving them a chance to be sampled.
Pros: Often achieves lower empirical regret than UCB; elegantly handles contextual and non-stationary bandits.
Cons: Computationally more intensive than epsilon-greedy; requires choosing a prior distribution.
Use Case: Personalized recommendation systems and dynamic pricing, where user preferences are modeled probabilistically.

EXP3 (Adversarial Bandits)

The Exponential-weight algorithm for Exploration and Exploitation (EXP3) is designed for the adversarial bandit setting, where reward distributions are not stationary but can change arbitrarily over time, potentially in response to the agent's actions.

Mechanism: Maintains a weight for each arm and selects an arm with a probability proportional to its exponentially weighted cumulative reward. It mixes in a uniform exploration probability to ensure no arm is completely ignored.
Key Feature: Provides strong performance guarantees even in worst-case, non-stochastic environments.
Cons: Generally performs worse than stochastic algorithms (like UCB or Thompson) in benign, stationary environments.
Use Case: Online routing in volatile networks, or any scenario where reward mechanisms may be actively opposed.

LinUCB (Contextual Bandits)

LinUCB extends the UCB principle to contextual bandits, where side information (context) is available at each decision point. It assumes the expected reward of an arm is a linear function of the context vector.

Mechanism: For each arm, it learns a linear model. The UCB score becomes the linear prediction plus an uncertainty term derived from the covariance of the observed contexts for that arm.
Pros: Enables personalization by incorporating user/context features into the decision.
Cons: Assumes a linear relationship between context and reward; computational cost scales with the number of arms and context dimension.
Use Case: News article recommendation, where the context includes user demographics and article topics.

Gittins Index (Bayesian Optimal)

The Gittins Index provides the theoretically optimal solution for the Bayesian multi-armed bandit problem with a geometric discounting of future rewards. It assigns a single index value to each arm based on its current posterior distribution and the discount factor.

Optimality: The policy of always playing the arm with the highest Gittins Index is optimal for the infinite-horizon discounted reward criterion.
Challenge: Computing the exact Gittins Index is complex and often requires pre-computed tables or approximations.
Use Case: Foundational in economics and optimal stopping theory; used in scenarios like research project selection where a precise, long-term optimal policy is required, despite computational cost.

EXPLORATION-EXPLOITATION FRAMEWORKS

Related Terms

The Multi-Armed Bandit problem is a cornerstone of sequential decision-making under uncertainty. These related concepts formalize the trade-offs between gathering new information and leveraging current knowledge.

Exploration-Exploitation Tradeoff

The fundamental dilemma in sequential decision-making where an agent must choose between exploring new actions to gather information about their rewards and exploiting known actions with the highest estimated reward. This trade-off is central to reinforcement learning, clinical trials, and recommendation systems.

Contextual Bandits extend this by incorporating state information.
Regret is the primary metric, measuring the cumulative loss from not always choosing the optimal action.

Contextual Bandits

An extension of the classic multi-armed bandit where each arm's reward distribution depends on contextual information (or features) available at each decision point. The agent learns a policy that maps contexts to actions.

Used in personalized recommendations (e.g., selecting news articles or ads based on user profile).
Algorithms like LinUCB and Thompson Sampling with linear models are standard solutions.
It bridges the gap between simple bandits and full reinforcement learning by adding state but not considering long-term consequences of actions.

Thompson Sampling

A Bayesian probability matching algorithm for solving the bandit problem. It maintains a posterior distribution over the reward of each arm and selects an arm by sampling from these posteriors and choosing the arm with the highest sampled value.

Naturally balances exploration and exploitation.
Theoretically proven to achieve near-optimal Bayesian regret.
Widely used in practice due to its simplicity and empirical performance, often outperforming UCB-based approaches.

Upper Confidence Bound (UCB)

A family of deterministic algorithms that select the arm with the highest Upper Confidence Bound on its expected reward. The bound is the sample mean plus a confidence interval term that shrinks as the arm is pulled more often.

UCB1 is a canonical algorithm with strong theoretical guarantees on cumulative regret.
The confidence term explicitly quantifies the uncertainty, forcing exploration of less-sampled arms.
Contrasts with Thompson Sampling's probabilistic approach, offering more interpretable selection criteria.

Regret Minimization

The primary objective in bandit and online learning problems. Regret is defined as the difference between the cumulative reward of the optimal policy (in hindsight) and the cumulative reward of the learner's policy.

Cumulative Regret grows sub-linearly with time for efficient algorithms.
Bayesian Regret considers expectation over a prior distribution of problem instances.
The goal is to design algorithms with provably low regret bounds, ensuring performance converges to optimal.

Adversarial Bandits

A variant of the bandit problem where the rewards for each arm are not generated by a stationary stochastic process but can be chosen by an adversary in each round. This models non-stationary or worst-case environments.

Algorithms like EXP3 (Exponential-weight algorithm for Exploration and Exploitation) are used.
The focus shifts from probabilistic regret to regret against any sequence of rewards.
Applicable to problems like routing in changing networks or playing repeated games against an opponent.

Multi-Armed Bandit

What is a Multi-Armed Bandit?

Key Multi-Armed Bandit Algorithms

Epsilon-Greedy

Upper Confidence Bound (UCB)

Thompson Sampling

EXP3 (Adversarial Bandits)

LinUCB (Contextual Bandits)

Gittins Index (Bayesian Optimal)

Frequently Asked Questions