Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems, such as the multi-armed bandit, where actions are selected by sampling from the posterior probability distribution over which action is optimal and then updating beliefs based on observed rewards. This elegant approach naturally balances trying uncertain actions (exploration) with favoring actions believed to be best (exploitation) by treating the problem as one of Bayesian inference. It is mathematically equivalent to probability matching.
Glossary
Thompson Sampling

What is Thompson Sampling?
Thompson Sampling is a foundational Bayesian algorithm for solving the exploration-exploitation dilemma in sequential decision-making.
The algorithm operates by maintaining a prior distribution over the expected reward of each action (e.g., a Beta distribution for Bernoulli rewards). On each trial, it samples a potential reward value from each action's current posterior. The action with the highest sampled value is selected and executed. Upon observing the actual reward, the corresponding action's posterior distribution is updated using Bayes' theorem. This process makes it highly effective for online learning and is a core component in Bayesian optimization and certain model-based reinforcement learning systems.
Key Features of Thompson Sampling
Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off. Its core features derive from its probabilistic approach to action selection and belief updating.
Bayesian Belief Representation
Thompson Sampling maintains a posterior probability distribution over the expected reward for each possible action. This distribution, often a Beta distribution for Bernoulli rewards or a Gaussian for continuous rewards, encodes the algorithm's uncertainty. A wide distribution indicates high uncertainty (encouraging exploration), while a narrow, peaked distribution indicates high confidence (encouraging exploitation).
Probability Matching Action Selection
The algorithm does not simply choose the action with the highest estimated mean reward. Instead, it performs probability matching: it selects an action with a probability equal to the posterior probability that said action is optimal. This is implemented by sampling a single reward estimate from the posterior distribution of each arm and then choosing the arm with the highest sampled value. This elegant mechanism naturally balances exploration and exploitation.
Sequential, Online Posterior Updates
Beliefs are updated sequentially and online after each action and observed reward. For a Bernoulli bandit (success/failure rewards), if a Beta(α, β) prior is used:
- A success increments α.
- A failure increments β. This Bayesian update ensures the posterior distribution incorporates all historical data, continuously refining the estimate of each action's reward probability. The algorithm requires no separate exploration phase.
Optimal Regret Performance
Thompson Sampling achieves asymptotically optimal regret bounds for standard multi-armed bandit problems. Regret is the difference between the cumulative reward of the optimal strategy and the algorithm's reward. For Bernoulli bandits, its regret grows logarithmically with the number of rounds, matching the theoretical lower bound. This makes it highly sample-efficient in practice.
Natural Handling of Contextual Bandits
The framework extends naturally to contextual bandits, where side information (context) is available before each decision. A probabilistic model (e.g., a Bayesian linear regression) predicts rewards given the context. Thompson Sampling operates by sampling a parameter vector from the posterior of this model and selecting the action that maximizes the reward under the sampled parameters. This is foundational for modern personalized recommendation systems.
Connection to Reinforcement Learning
In model-based reinforcement learning, Thompson Sampling can be applied at the level of the environment model. An agent maintains a posterior distribution over possible world models (e.g., over transition dynamics). It samples a model, plans an optimal policy under that sampled model, acts, and then updates the model posterior with the observed outcome. This is known as Posterior Sampling for Reinforcement Learning (PSRL).
Frequently Asked Questions
A deep dive into the Bayesian algorithm for balancing exploration and exploitation in sequential decision-making, foundational to reinforcement learning and multi-armed bandit problems.
Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems, where actions are selected by sampling from the posterior distribution over the optimal action and updating beliefs based on observed rewards. The core mechanism operates in a loop: for each decision round, the algorithm samples a single plausible reward model from the current posterior distribution for each possible action (e.g., each 'arm' of a multi-armed bandit). It then executes the action whose sampled model promises the highest immediate reward. After observing the actual reward, it updates the posterior distribution for that specific action using Bayes' theorem, refining its belief about that action's true reward distribution. This simple 'sample-then-update' process naturally balances exploration (trying uncertain actions) and exploitation (choosing actions believed to be best), as actions with high uncertainty have a wider posterior and thus a non-zero probability of being sampled as optimal.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Thompson Sampling is a core algorithm for balancing exploration and exploitation in sequential decision-making. The following terms are essential for understanding its context, alternatives, and mathematical foundations.
Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem is the foundational sequential decision-making framework where an agent must repeatedly choose between multiple actions (arms), each providing a stochastic reward drawn from an unknown distribution. The core challenge is the exploration-exploitation trade-off: balancing the need to gather information about uncertain arms (exploration) with the desire to capitalize on the arm currently believed to be best (exploitation). Thompson Sampling is a Bayesian solution to this problem.
Upper Confidence Bound (UCB)
Upper Confidence Bound (UCB) is a deterministic, frequentist alternative to Thompson Sampling for solving the multi-armed bandit problem. Instead of sampling from a posterior, UCB algorithms select the arm with the highest upper confidence bound on its expected reward. This bound is calculated as the current empirical mean plus an exploration bonus that shrinks as the arm is pulled more often. Key variants include UCB1 and KL-UCB. Unlike Thompson Sampling, UCB provides explicit, worst-case regret bounds.
Bayesian Inference
Bayesian Inference is the statistical paradigm that underpins Thompson Sampling. It involves updating the probability for a hypothesis (e.g., 'arm A is optimal') as more evidence (observed rewards) becomes available. The process follows Bayes' Theorem: Posterior ∝ Likelihood × Prior.
- Prior: Initial belief about reward distributions (e.g., Beta(1,1) for Bernoulli rewards).
- Likelihood: Probability of observed data given the parameters.
- Posterior: Updated belief after observing data. Thompson Sampling acts by sampling parameters from this posterior.
Conjugate Prior
A Conjugate Prior is a probability distribution that, when used as a prior for a specific likelihood function, results in a posterior distribution that is in the same family. This mathematical convenience is critical for the efficient, closed-form updates in Thompson Sampling.
- For Bernoulli/Binomial rewards (success/failure), the conjugate prior is the Beta distribution.
- For Gaussian rewards with known variance, the conjugate prior for the mean is another Gaussian distribution. Using a conjugate prior allows the agent to update its belief state (the posterior parameters) analytically after each observation, enabling real-time decision-making.
Regret
Regret is the primary performance metric for evaluating bandit algorithms like Thompson Sampling. It quantifies the cumulative opportunity cost of not always selecting the optimal action.
- Cumulative Regret: The sum over time of the difference between the reward of the optimal arm and the reward of the arm actually chosen.
- Bayesian Regret: The expected cumulative regret, where the expectation is taken over both the randomness in the algorithm and the prior distribution over problem instances. A key goal is to design algorithms with sublinear regret, meaning the average regret per round goes to zero over time, proving the agent is learning optimally.
Contextual Bandits
Contextual Bandits extend the classic multi-armed bandit framework by incorporating side information or context. Before each round, the agent observes a context vector (e.g., user features, time of day) that can inform which arm is likely to be best. The goal is to learn a policy that maps contexts to actions. Thompson Sampling can be extended to this setting, often by using a Bayesian linear model (e.g., with a Gaussian prior) to parameterize the expected reward for each arm given the context, enabling personalized decision-making.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us