Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation dilemma in the multi-armed bandit problem. An agent selects an action by first sampling a potential reward value from the posterior probability distribution maintained for each available arm (or action) and then choosing the arm that yielded the highest sampled value. This stochastic mechanism naturally balances trying uncertain options (exploration) and favoring options believed to be best (exploitation).
Glossary
Thompson Sampling

What is Thompson Sampling?
Thompson Sampling is a foundational algorithm for balancing exploration and exploitation in sequential decision-making problems, particularly the multi-armed bandit.
The algorithm begins with a prior distribution (e.g., Beta for binary rewards, Gaussian for continuous) for each arm's unknown reward rate. After each action and observed reward, it performs a Bayesian update to refine the corresponding posterior. This approach is probabilistically optimal and widely applied in contextual bandits, recommendation systems, and hyperparameter optimization, where it efficiently navigates large search spaces. Its connection to recursive self-improvement lies in its use as a meta-algorithm for guiding an agent's own learning and adaptation processes.
Key Characteristics of Thompson Sampling
Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation trade-off by selecting actions based on samples drawn from posterior probability distributions.
Bayesian Probability Matching
The core mechanism of Thompson Sampling is probability matching. Instead of always choosing the arm with the highest estimated mean reward (pure exploitation), it selects each arm with a probability equal to that arm being the optimal one, given the current uncertainty. This is achieved by:
- Maintaining a posterior distribution (e.g., Beta for Bernoulli rewards, Gaussian for normal rewards) for the expected reward of each arm.
- On each trial, sampling a single value from each arm's posterior distribution.
- Choosing the arm whose sampled value is the highest. This stochastic selection naturally balances exploration (sampling uncertain arms) and exploitation (sampling promising arms).
Automatic Uncertainty Quantification
Thompson Sampling inherently quantifies and responds to uncertainty through its Bayesian framework. The variance of the posterior distribution for each arm directly represents the algorithm's uncertainty about that arm's true reward rate.
- High Variance (Early Stages): For an arm with few pulls, the posterior is wide. Random samples from this distribution can vary widely, giving it a significant chance of being selected over a known-good arm, promoting exploration.
- Low Variance (Later Stages): As an arm is pulled more, its posterior distribution tightens around the empirical mean. Its sampled values become consistent, and it is only selected if its mean is truly high, leading to exploitation. This dynamic adjustment requires no external tuning of an exploration parameter (like ε in ε-greedy).
Asymptotic Optimality & Regret Bounds
Thompson Sampling provides strong theoretical performance guarantees. For the standard K-armed bandit problem, it achieves logarithmic expected cumulative regret under certain conditions, matching the asymptotic lower bound and making it asymptotically optimal.
- Cumulative Regret: The total difference in reward between always pulling the optimal arm and the algorithm's choices. Thompson Sampling's regret grows as O(log T).
- Frequentist Analysis: Modern analyses have proven these bounds without relying on a 'correct' Bayesian prior, showing robustness.
- Contextual Bandits: Extensions to linear contextual bandits also achieve near-optimal regret bounds (O(√T) or O(log T) depending on assumptions), making it applicable to personalized recommendations and dynamic treatment assignment.
Computational & Implementation Simplicity
Despite its Bayesian foundation, Thompson Sampling is often computationally straightforward to implement and scale.
- Conjugate Priors: Using conjugate prior-likelihood pairs (e.g., Beta-Bernoulli, Gaussian-Gaussian) allows the posterior to be updated with simple arithmetic upon observing a reward, requiring only O(1) computation per update.
- Parallelization: Sampling and selection for each arm is independent, making it easy to parallelize across a large set of arms.
- Comparison to UCB: Unlike Upper Confidence Bound (UCB) algorithms, which require calculating confidence intervals, Thompson Sampling only requires drawing random samples, which can be more efficient in complex models (e.g., deep neural network-based bandits where sampling from a posterior approximation is simpler than computing high-dimensional confidence sets).
Natural Extension to Complex Models
The sampling-based approach generalizes elegantly beyond simple bandits to complex, real-world problems.
- Contextual Bandits with Deep Learning (Neural Thompson Sampling): The reward function is modeled by a deep neural network. A posterior over network weights is approximated (e.g., via bootstrapping, dropout as approximate Bayesian inference, or Laplace approximation). An action is chosen by sampling a network from this posterior and selecting the context-action pair with the highest predicted reward.
- Non-Stationary Environments: Can adapt to changing reward distributions by using forgetting mechanisms like discounting old observations or using change-point models within the Bayesian update.
- Hierarchical Models: Naturally incorporates shared structure between arms (e.g., in multi-task or meta-learning bandit settings) through hierarchical priors, allowing for efficient knowledge transfer.
Relationship to Other Bandit Algorithms
Thompson Sampling occupies a distinct point in the design space of bandit algorithms.
- vs. ε-Greedy: ε-greedy explores randomly with a fixed probability ε, ignoring uncertainty. Thompson Sampling explores intelligently, proportionally to the probability an arm is optimal.
- vs. Upper Confidence Bound (UCB): UCB is a deterministic, optimism-in-the-face-of-uncertainty principle. It selects the arm with the highest upper confidence bound. Thompson Sampling is randomized and directly samples from the posterior. The two are often viewed as dual perspectives (frequentist optimism vs. Bayesian probability matching) and can have similar regret bounds.
- vs. Gittins Index: The Gittins Index provides an optimal solution for the discounted infinite-horizon Bayesian bandit problem but is often computationally intractable. Thompson Sampling is a simple, near-optimal heuristic for both discounted and finite-horizon settings.
Thompson Sampling vs. Other Bandit Algorithms
A comparison of heuristic strategies for solving the multi-armed bandit problem, focusing on their approach to balancing the exploration of uncertain options with the exploitation of known rewards.
| Feature / Mechanism | Thompson Sampling (Bayesian Bandit) | ε-Greedy | Upper Confidence Bound (UCB) | Gradient Bandit Algorithm |
|---|---|---|---|---|
Core Selection Principle | Sample from posterior, choose max sample | Choose random arm with probability ε, else choose best empirical mean | Choose arm maximizing empirical mean + confidence bound | Choose arm with probability from a softmax of preference scores |
Underlying Framework | Bayesian probability (explicit posterior distributions) | Frequentist statistics (empirical averages) | Frequentist statistics with concentration inequalities | Stochastic gradient ascent on expected reward |
Exploration Strategy | Probability matching via posterior sampling | Undirected, random exploration | Optimistic exploration of plausible best arms | Directed exploration via preference updates |
Handles Delayed Feedback | ||||
Incorporates Prior Knowledge | ||||
Theoretical Regret Bounds | Asymptotically optimal for Bernoulli rewards | Linear regret if ε is fixed | Logarithmic problem-dependent regret | Sublinear regret with appropriate step-size |
Computational Overhead | Medium (requires maintaining & sampling from posteriors) | Low (requires only mean estimates) | Low (requires mean estimates and visit counts) | Low (requires preference scores and a baseline) |
Natural for Non-Stationary Environments |
Frequently Asked Questions
Thompson Sampling is a foundational algorithm for balancing exploration and exploitation, crucial for autonomous systems that must learn optimal actions through trial and error. These FAQs address its core mechanisms, applications, and relationship to broader concepts in agentic and recursive systems.
Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation dilemma in sequential decision-making problems, most classically the multi-armed bandit. It works by maintaining a posterior probability distribution over the expected reward for each possible action (or 'arm'). To select an action, the algorithm samples a single reward value from each action's posterior distribution and then executes the action with the highest sampled value. After observing the actual reward, it updates the posterior distribution for that action using Bayes' rule, refining its beliefs. This simple probability matching strategy naturally balances trying uncertain actions (exploration) and favoring actions believed to be best (exploitation).
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Thompson Sampling is a core technique for balancing exploration and exploitation in sequential decision-making. The following terms are foundational to understanding its context, alternatives, and extensions.
Multi-Armed Bandit Problem
The foundational sequential decision-making framework where an agent repeatedly chooses from a set of actions (arms), each providing a stochastic reward from an unknown distribution. The core tension is between exploitation (choosing the arm with the highest estimated reward) and exploration (trying other arms to gather more information). Thompson Sampling is a heuristic solution to this problem.
Bayesian Optimization
A global optimization strategy for expensive black-box functions. Like Thompson Sampling, it uses a probabilistic surrogate model (e.g., a Gaussian Process) to model the objective. It selects the next point to evaluate by optimizing an acquisition function, which explicitly balances exploration and exploitation. Common functions include Expected Improvement (EI) and Upper Confidence Bound (UCB).
Upper Confidence Bound (UCB)
A deterministic alternative to Thompson Sampling for the multi-armed bandit problem. The algorithm selects the arm with the highest upper confidence bound, calculated as the estimated mean reward plus an exploration bonus. The bonus (e.g., √(2 log t / n_i)) shrinks as an arm is pulled more often (n_i), ensuring all arms are sufficiently explored. It provides strong theoretical regret bounds.
Contextual Bandits
An extension of the classic bandit problem where, before each decision, the agent observes a context vector (e.g., user features, time of day). The goal is to learn a policy mapping contexts to actions to maximize reward. Thompson Sampling is readily extended to this setting using contextual models like Bayesian linear regression or neural networks with approximate posterior distributions.
Reinforcement Learning (RL)
A broader framework for sequential decision-making in environments with delayed rewards and state transitions. The multi-armed bandit is a stateless, single-step RL problem. Thompson Sampling's principle of posterior sampling extends to RL as Posterior Sampling for Reinforcement Learning (PSRL), where the agent samples a full Markov Decision Process (MDP) from a posterior and acts optimally within it for an episode.
Monte Carlo Methods
A class of computational algorithms that rely on repeated random sampling to obtain numerical results. Thompson Sampling is inherently a Monte Carlo method: it makes decisions by sampling from posterior distributions. This connects it to broader techniques like Monte Carlo Tree Search (MCTS), which uses random simulations to evaluate actions in complex decision trees.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us