Thompson sampling is a Bayesian algorithm for solving the multi-armed bandit problem, a sequential decision-making framework that balances exploration of uncertain options with exploitation of the best-known option. Unlike traditional A/B testing, which uses fixed traffic splits, it dynamically allocates traffic by sampling from a posterior probability distribution for each variant's reward and selecting the variant with the highest sampled value, naturally favoring more promising options over time.
Glossary
Thompson Sampling

What is Thompson Sampling?
Thompson sampling is a Bayesian algorithm for solving the multi-armed bandit problem that selects an action by sampling from the posterior probability distribution of each variant's reward and choosing the one with the highest sampled value.
The algorithm begins with a prior belief (e.g., a Beta distribution) about each variant's performance. After each user interaction, it observes the outcome (e.g., a click or conversion) and updates the corresponding variant's posterior distribution using Bayesian inference. This continuous update allows it to adapt allocation in real-time, making it highly efficient for optimizing metrics like click-through rate. It inherently addresses the explore-exploit trade-off and mitigates the peeking problem associated with frequentist sequential testing.
Key Features of Thompson Sampling
Thompson Sampling is a probability-matching algorithm for the multi-armed bandit problem that balances exploration and exploitation by sampling from posterior distributions. Its core features stem from its Bayesian foundation.
Bayesian Probability Matching
At its core, Thompson Sampling is a probability matching algorithm. It selects an action with a probability equal to that action being the optimal one, given the current uncertainty. This is achieved by:
- Maintaining a posterior probability distribution for the expected reward of each variant (arm).
- On each decision, sampling a single value from each variant's posterior.
- Choosing the variant with the highest sampled value. This elegantly encodes uncertainty—a variant with a wide posterior (high uncertainty) has a higher chance of being sampled as the best, even if its current mean estimate is low.
Automatic Exploration-Exploitation Trade-off
The algorithm dynamically balances exploration (trying uncertain options) and exploitation (choosing the best-known option) without requiring an explicit tuning parameter like the epsilon in ε-greedy methods. The trade-off is inherent:
- Exploitation: A variant with a high mean and low variance will consistently yield high samples.
- Exploration: A variant with high variance (uncertainty) will occasionally produce a high sample, prompting a trial. As data accumulates, posterior distributions sharpen, and exploration naturally diminishes, converging to the optimal variant.
Contextual Bandit Extension
A powerful extension is the Contextual Thompson Sampling, which incorporates feature vectors (context) about each decision opportunity. For each arm, it models the reward as a function of the context, typically using a Bayesian linear model.
- The algorithm maintains a posterior over the model's weight vectors.
- For a given context, it samples a weight vector from the posterior for each arm, predicts a reward, and selects the arm with the highest sampled prediction. This allows for personalized decisions, making it highly effective in recommendation systems and dynamic treatment assignment.
Natural Handling of Delayed Feedback
Thompson Sampling integrates naturally with delayed or batched feedback, a common challenge in online systems. The Bayesian update mechanism is agnostic to the timing of data arrival.
- Observations (rewards) can be incorporated into the posterior distributions as they arrive, regardless of delay.
- The algorithm continues to make decisions based on the most recent posterior state. This contrasts with some frequentist bandit algorithms that may require complex adjustments for delayed rewards.
Computational Efficiency with Conjugate Priors
For common reward models, Thompson Sampling can be extremely computationally efficient through the use of conjugate priors.
- Binary rewards (click/no-click): Use a Beta distribution prior, which updates to a Beta posterior with simple count updates.
- Gaussian rewards: Use a Normal-Gamma or Normal-Inverse-Gamma prior for conjugate updates of the mean and variance. This allows for constant-time updates per observation, making the algorithm scalable for real-time systems with thousands of arms.
Asymptotic Optimality and Regret Bounds
Thompson Sampling provides strong theoretical performance guarantees. It is asymptotically optimal, meaning its cumulative regret grows at the best possible rate as the number of trials increases.
- Cumulative Regret: The total loss incurred from not always playing the optimal arm. For Bernoulli bandits, Thompson Sampling achieves regret that is O(log T).
- These bounds are derived from its probability-matching property, which ensures it does not under-explore the optimal arm. The algorithm's empirical performance often matches or exceeds that of Upper Confidence Bound (UCB) methods.
Thompson Sampling vs. Other Bandit Algorithms
A feature and operational comparison of Thompson Sampling against other prominent multi-armed bandit algorithms, highlighting differences in statistical approach, exploration strategy, and practical implementation.
| Feature / Metric | Thompson Sampling (Bayesian) | Epsilon-Greedy | Upper Confidence Bound (UCB) |
|---|---|---|---|
Core Statistical Foundation | Bayesian inference; samples from posterior reward distributions | Frequentist; uses empirical average rewards | Frequentist; uses confidence bounds derived from concentration inequalities |
Primary Exploration Mechanism | Probability matching via posterior sampling | Random exploration with fixed probability (ε) | Optimistic exploration of upper confidence bounds |
Adapts Exploration Rate Automatically | |||
Explicitly Models Reward Uncertainty | |||
Handles Delayed or Batched Feedback | |||
Computational Overhead per Decision | Medium (requires sampling from posteriors) | Low (requires maintaining averages) | Low (requires calculating confidence bounds) |
Typical Regret Performance | Asymptotically optimal; often superior empirically | Suboptimal; linear regret if ε is constant | Asymptotically optimal for specific bound formulations |
Requires Tuning of Hyperparameters | Prior parameters (often uninformative) | Exploration rate (ε) | Confidence bound parameter (α or c) |
Natural Incorporation of Prior Knowledge | |||
Common Use Case | Contextual bandits, recommendation systems, clinical trials | Simple A/B testing with basic exploration | Stochastic bandits with well-defined reward distributions |
Real-World Applications
Thompson Sampling's Bayesian approach to balancing exploration and exploitation makes it a powerful algorithm for optimizing decisions in dynamic, uncertain environments. Its applications span from digital interfaces to complex physical systems.
Website & UI Optimization
Thompson Sampling is a cornerstone of modern conversion rate optimization and personalization engines. It dynamically tests variations of web page elements (like headlines, button colors, or layouts) to maximize user engagement metrics.
- Key Use Cases: Optimizing checkout flows, article recommendations, and ad placements.
- Advantage over A/B Testing: Continuously reallocates traffic to better-performing variants, minimizing opportunity cost during experiments.
- Real Example: Major e-commerce platforms use it to personalize homepages, showing different product assortments to maximize session value.
Clinical Trial Adaptive Design
In adaptive clinical trials, Thompson Sampling helps allocate patients to different treatment arms (e.g., drug dosage levels or novel therapies) to ethically maximize positive outcomes during the study.
- Ethical Imperative: It inherently favors treatments showing early promise, potentially benefiting more participants.
- Operational Efficiency: Can reduce the required sample size and trial duration compared to fixed randomization.
- Consideration: Requires careful regulatory compliance and simulation to manage operational complexities.
Recommendation & Content Systems
Used to manage the explore-exploit trade-off in recommender systems. It selects which items (movies, products, news articles) to suggest from a large, changing catalog to maximize long-term user satisfaction.
- Cold-Start Problem: Efficiently introduces new items with uncertain appeal into the recommendation pool.
- Dynamic Catalog: Adapts to changing item popularity and user taste drift over time.
- Architecture: Often implemented as a contextual bandit, where the algorithm uses user features (context) to personalize the sampling decision.
Autonomous Systems & Robotics
Thompson Sampling guides robots and autonomous agents in making sequential decisions under uncertainty, such as path planning or resource gathering.
- Robotic Exploration: A robot can use it to decide which area of an unknown environment to explore next, balancing the reward of potential discoveries with the cost of movement.
- Multi-Agent Coordination: In fleet orchestration, it can help allocate tasks (like delivery routes) to different robots based on their dynamically estimated performance and battery levels.
- Real-World Link: Research platforms like Google's Robotics explore these applications. (Reference: https://ai.google/research/teams/brain/robotics/)
Financial Portfolio Optimization
Applied as a Bayesian portfolio strategy to dynamically allocate capital across a set of assets (the 'arms') with uncertain, non-stationary returns.
- Sequential Decision-Making: Treats each rebalancing period as a pull of a bandit arm, where the reward is the asset's return.
- Handles Uncertainty: Naturally incorporates prior beliefs about asset performance and updates them with observed market data.
- Risk Management: The sampling mechanism inherently explores to avoid over-committing to assets based on potentially spurious short-term trends.
Industrial Process & Pricing Optimization
Optimizes parameters in complex systems where the reward function is noisy or expensive to evaluate.
- Smart Manufacturing: Tuning machine settings (e.g., temperature, pressure) to maximize yield or quality, where each experiment is costly.
- Dynamic Pricing: Testing different price points for products or services in real-time to maximize revenue, while learning price elasticity.
- Key Benefit: Its sample-based action selection is robust to reward noise and non-Gaussian distributions common in industrial data.
Frequently Asked Questions
Thompson sampling is a foundational Bayesian algorithm for solving the multi-armed bandit problem, balancing exploration and exploitation in online decision-making. These FAQs address its core mechanisms, applications, and practical considerations for CTOs and product managers implementing A/B testing frameworks.
Thompson sampling is a Bayesian algorithm for solving the multi-armed bandit problem that selects an action by sampling a potential reward value from the posterior probability distribution of each variant and choosing the variant with the highest sampled value. It works through a continuous loop: 1) For each variant (e.g., a webpage layout or model configuration), it maintains a posterior distribution over its expected reward rate (like click-through rate). 2) On each user visit, it draws one random sample from each variant's posterior. 3) It serves the variant with the highest sampled value. 4) It observes the binary outcome (click/no-click, conversion/no-conversion) and updates that variant's posterior distribution using Bayes' theorem, typically with a Beta-Binomial conjugate prior for binary rewards. This process inherently balances exploration (trying variants with uncertain, but potentially high, rewards) and exploitation (using the variant currently believed to be best).
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Thompson Sampling operates within a broader ecosystem of statistical methods and experimental frameworks. These related concepts define the problems it solves, the alternatives it contrasts with, and the foundational mathematics it employs.
Multi-Armed Bandit
The multi-armed bandit is the core sequential decision-making framework that Thompson Sampling solves. It models a gambler choosing between multiple slot machines ("arms") with unknown, fixed reward probabilities. The fundamental tension is between:
- Exploration: Gathering information about uncertain arms.
- Exploitation: Leveraging the arm currently believed to be best. Thompson Sampling provides an optimal Bayesian solution to this exploration-exploitation trade-off, dynamically allocating traffic to maximize cumulative reward over time, unlike static A/B tests.
Bayesian Inference
Thompson Sampling is fundamentally a Bayesian algorithm. It relies on maintaining a posterior probability distribution for the reward rate of each variant (arm). This posterior is updated using Bayes' theorem as new data arrives:
Posterior ∝ Likelihood × Prior.
- Prior: Initial belief about an arm's performance (e.g., a Beta distribution).
- Likelihood: The probability of observing the collected data given a reward rate.
- Posterior: The updated belief after seeing the data. By sampling from these posteriors, Thompson Sampling naturally quantifies and acts upon uncertainty.
A/B Testing
A/B testing (or split testing) is the traditional frequentist alternative to bandit algorithms like Thompson Sampling. Key contrasts include:
- Fixed Allocation: Traffic is split 50/50 (or similar) for a predetermined sample size.
- Fixed Timeline: The experiment runs to completion before a winner is declared.
- Objective: Primarily focused on statistical significance and confident inference, not minimizing regret during the test. While A/B testing provides rigorous hypothesis testing, Thompson Sampling is designed for adaptive optimization, reducing opportunity cost during the experiment by favoring better-performing variants earlier.
Upper Confidence Bound (UCB)
Upper Confidence Bound is another major class of algorithms for solving the multi-armed bandit problem, but from a frequentist perspective. While Thompson Sampling uses probability matching (sampling from posteriors), UCB uses optimism in the face of uncertainty.
- It calculates a confidence interval for each arm's reward.
- It always selects the arm with the highest upper bound of this interval.
- This deterministic approach aggressively explores arms with high potential. UCB and Thompson Sampling are the two most cited optimal solutions, with Thompson often praised for better empirical performance in complex scenarios.
Contextual Bandits
A contextual bandit is an advanced extension of the standard multi-armed bandit. In this framework, before each decision, the algorithm observes a context vector (e.g., user features, time of day). The goal is to learn a policy that maps contexts to optimal arms.
- Thompson Sampling can be extended to this setting, often using Bayesian linear regression to model the relationship between context and reward.
- The posterior distribution is over the parameters of this regression model.
- This allows for personalized adaptive decision-making at scale, a cornerstone of modern recommendation systems.
Regret
Regret is the primary theoretical metric for evaluating bandit algorithms like Thompson Sampling. It measures the cost of exploration.
- Cumulative Regret: The total difference between the rewards earned by an omniscient optimal policy (always picking the best arm) and the rewards earned by the algorithm.
- Bayesian Regret: The expected cumulative regret with respect to the prior distribution over problem instances. A key theoretical result is that Thompson Sampling achieves sublinear regret, meaning regret grows slower than time, and the average regret per round approaches zero—proving its asymptotic optimality.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us