A multi-armed bandit is a classic reinforcement learning problem that formalizes the exploration-exploitation tradeoff. It is modeled as an agent repeatedly choosing from multiple actions ("arms"), each providing a reward drawn from an unknown probability distribution. The agent's objective is to maximize its cumulative reward over time by intelligently balancing trying lesser-known arms to gather information (exploration) and selecting the arm currently estimated to be best (exploitation).
Glossary
Multi-Armed Bandit

What is a Multi-Armed Bandit?
A formal framework for decision-making under uncertainty that models the fundamental tradeoff between exploring new options and exploiting known rewards.
Algorithms like Upper Confidence Bound (UCB) and Thompson Sampling provide principled strategies for this tradeoff. UCB selects arms based on their estimated reward plus an uncertainty bonus, while Thompson Sampling uses a Bayesian approach, sampling rewards from posterior distributions. These frameworks are foundational to A/B testing, clinical trial design, and recommendation systems, where they enable efficient online optimization without the need for exhaustive exploration.
Key Multi-Armed Bandit Algorithms
Multi-armed bandit algorithms provide formal strategies for balancing the exploration of uncertain options with the exploitation of known high-reward actions. These algorithms are foundational to adaptive systems in recommendation engines, clinical trials, and online advertising.
Epsilon-Greedy
The epsilon-greedy algorithm is the simplest and most widely used bandit strategy. It operates by selecting the action currently estimated to be the best (exploitation) with probability 1 - ε, and selecting a random action (exploration) with probability ε. The parameter ε (epsilon) is typically a small constant, like 0.1.
- Pros: Extremely simple to implement and understand.
- Cons: Exploration is undirected and inefficient; continues random exploration even after the optimal arm is identified.
- Use Case: Ideal for initial prototyping or environments where computational simplicity is paramount.
Upper Confidence Bound (UCB)
The Upper Confidence Bound (UCB) family of algorithms selects actions based on an optimistic estimate of their potential reward. The core principle is optimism in the face of uncertainty. It chooses the arm that maximizes a score: the current sample mean plus a confidence interval that shrinks as the arm is pulled more often.
- Formula: The canonical UCB1 score for arm i is:
mean_i + sqrt(2 * ln(total_pulls) / pulls_i). - Pros: Provides a deterministic, theoretically-grounded balance between exploration and exploitation with strong regret bounds.
- Cons: Requires knowledge of total time horizon for some variants; can be sensitive to the scaling of rewards.
- Use Case: Online advertising A/B testing, where a provable regret guarantee is desirable.
Thompson Sampling
Thompson Sampling is a Bayesian probabilistic algorithm. It maintains a probability distribution (posterior) over the expected reward of each arm. At each step, it samples a reward estimate from each arm's posterior distribution and selects the arm with the highest sampled value. It then updates the posterior based on the observed reward.
- Mechanism: Naturally balances exploration and exploitation; arms with high uncertainty have wider posteriors, giving them a chance to be sampled.
- Pros: Often achieves lower empirical regret than UCB; elegantly handles contextual and non-stationary bandits.
- Cons: Computationally more intensive than epsilon-greedy; requires choosing a prior distribution.
- Use Case: Personalized recommendation systems and dynamic pricing, where user preferences are modeled probabilistically.
EXP3 (Adversarial Bandits)
The Exponential-weight algorithm for Exploration and Exploitation (EXP3) is designed for the adversarial bandit setting, where reward distributions are not stationary but can change arbitrarily over time, potentially in response to the agent's actions.
- Mechanism: Maintains a weight for each arm and selects an arm with a probability proportional to its exponentially weighted cumulative reward. It mixes in a uniform exploration probability to ensure no arm is completely ignored.
- Key Feature: Provides strong performance guarantees even in worst-case, non-stochastic environments.
- Cons: Generally performs worse than stochastic algorithms (like UCB or Thompson) in benign, stationary environments.
- Use Case: Online routing in volatile networks, or any scenario where reward mechanisms may be actively opposed.
LinUCB (Contextual Bandits)
LinUCB extends the UCB principle to contextual bandits, where side information (context) is available at each decision point. It assumes the expected reward of an arm is a linear function of the context vector.
- Mechanism: For each arm, it learns a linear model. The UCB score becomes the linear prediction plus an uncertainty term derived from the covariance of the observed contexts for that arm.
- Pros: Enables personalization by incorporating user/context features into the decision.
- Cons: Assumes a linear relationship between context and reward; computational cost scales with the number of arms and context dimension.
- Use Case: News article recommendation, where the context includes user demographics and article topics.
Gittins Index (Bayesian Optimal)
The Gittins Index provides the theoretically optimal solution for the Bayesian multi-armed bandit problem with a geometric discounting of future rewards. It assigns a single index value to each arm based on its current posterior distribution and the discount factor.
- Optimality: The policy of always playing the arm with the highest Gittins Index is optimal for the infinite-horizon discounted reward criterion.
- Challenge: Computing the exact Gittins Index is complex and often requires pre-computed tables or approximations.
- Use Case: Foundational in economics and optimal stopping theory; used in scenarios like research project selection where a precise, long-term optimal policy is required, despite computational cost.
Frequently Asked Questions
The multi-armed bandit (MAB) problem is a foundational framework in reinforcement learning and decision theory that formalizes the exploration-exploitation tradeoff. These questions address its core mechanisms, algorithms, and applications in AI systems.
The Multi-Armed Bandit (MAB) problem is a sequential decision-making framework where an agent repeatedly chooses from a set of actions (the 'bandits' or 'arms'), each providing a stochastic reward drawn from an unknown probability distribution, with the goal of maximizing cumulative reward over time. It is the canonical model for the exploration-exploitation tradeoff, forcing the agent to balance trying new arms to gather information (exploration) and selecting the arm currently believed to be best (exploitation). The name is derived from a gambler facing a row of slot machines ('one-armed bandits'), each with an unknown payout rate.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The Multi-Armed Bandit problem is a cornerstone of sequential decision-making under uncertainty. These related concepts formalize the trade-offs between gathering new information and leveraging current knowledge.
Exploration-Exploitation Tradeoff
The fundamental dilemma in sequential decision-making where an agent must choose between exploring new actions to gather information about their rewards and exploiting known actions with the highest estimated reward. This trade-off is central to reinforcement learning, clinical trials, and recommendation systems.
- Contextual Bandits extend this by incorporating state information.
- Regret is the primary metric, measuring the cumulative loss from not always choosing the optimal action.
Contextual Bandits
An extension of the classic multi-armed bandit where each arm's reward distribution depends on contextual information (or features) available at each decision point. The agent learns a policy that maps contexts to actions.
- Used in personalized recommendations (e.g., selecting news articles or ads based on user profile).
- Algorithms like LinUCB and Thompson Sampling with linear models are standard solutions.
- It bridges the gap between simple bandits and full reinforcement learning by adding state but not considering long-term consequences of actions.
Thompson Sampling
A Bayesian probability matching algorithm for solving the bandit problem. It maintains a posterior distribution over the reward of each arm and selects an arm by sampling from these posteriors and choosing the arm with the highest sampled value.
- Naturally balances exploration and exploitation.
- Theoretically proven to achieve near-optimal Bayesian regret.
- Widely used in practice due to its simplicity and empirical performance, often outperforming UCB-based approaches.
Upper Confidence Bound (UCB)
A family of deterministic algorithms that select the arm with the highest Upper Confidence Bound on its expected reward. The bound is the sample mean plus a confidence interval term that shrinks as the arm is pulled more often.
- UCB1 is a canonical algorithm with strong theoretical guarantees on cumulative regret.
- The confidence term explicitly quantifies the uncertainty, forcing exploration of less-sampled arms.
- Contrasts with Thompson Sampling's probabilistic approach, offering more interpretable selection criteria.
Regret Minimization
The primary objective in bandit and online learning problems. Regret is defined as the difference between the cumulative reward of the optimal policy (in hindsight) and the cumulative reward of the learner's policy.
- Cumulative Regret grows sub-linearly with time for efficient algorithms.
- Bayesian Regret considers expectation over a prior distribution of problem instances.
- The goal is to design algorithms with provably low regret bounds, ensuring performance converges to optimal.
Adversarial Bandits
A variant of the bandit problem where the rewards for each arm are not generated by a stationary stochastic process but can be chosen by an adversary in each round. This models non-stationary or worst-case environments.
- Algorithms like EXP3 (Exponential-weight algorithm for Exploration and Exploitation) are used.
- The focus shifts from probabilistic regret to regret against any sequence of rewards.
- Applicable to problems like routing in changing networks or playing repeated games against an opponent.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us