Free 30-minute system review for production AI teams

Guides on retrieval, evaluation, orchestration, and production AI delivery

Need help designing, building, or shipping a production AI system?

Get in touch

Compare architectures, tradeoffs, and implementation paths

See comparisons

Free 30-minute system review for production AI teams

Book a call

Guides on retrieval, evaluation, orchestration, and production AI delivery

Browse guides

Need help designing, building, or shipping a production AI system?

Get in touch

Compare architectures, tradeoffs, and implementation paths

See comparisons

Thompson Sampling: Bayesian Algorithm for Exploration-Exploitation | Inference Systems

Reference

Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems by sampling from posterior distributions to select actions.

Analyst workspace with documents, metrics printouts, and a search-enabled laptop.

BAYESIAN BANDIT ALGORITHM

What is Thompson Sampling?

Thompson Sampling is a foundational Bayesian algorithm for solving the exploration-exploitation dilemma in sequential decision-making.

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems, such as the multi-armed bandit, where actions are selected by sampling from the posterior probability distribution over which action is optimal and then updating beliefs based on observed rewards. This elegant approach naturally balances trying uncertain actions (exploration) with favoring actions believed to be best (exploitation) by treating the problem as one of Bayesian inference. It is mathematically equivalent to probability matching.

The algorithm operates by maintaining a prior distribution over the expected reward of each action (e.g., a Beta distribution for Bernoulli rewards). On each trial, it samples a potential reward value from each action's current posterior. The action with the highest sampled value is selected and executed. Upon observing the actual reward, the corresponding action's posterior distribution is updated using Bayes' theorem. This process makes it highly effective for online learning and is a core component in Bayesian optimization and certain model-based reinforcement learning systems.

ALGORITHMIC MECHANISMS

Key Features of Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off. Its core features derive from its probabilistic approach to action selection and belief updating.

Bayesian Belief Representation

Thompson Sampling maintains a posterior probability distribution over the expected reward for each possible action. This distribution, often a Beta distribution for Bernoulli rewards or a Gaussian for continuous rewards, encodes the algorithm's uncertainty. A wide distribution indicates high uncertainty (encouraging exploration), while a narrow, peaked distribution indicates high confidence (encouraging exploitation).

Probability Matching Action Selection

The algorithm does not simply choose the action with the highest estimated mean reward. Instead, it performs probability matching: it selects an action with a probability equal to the posterior probability that said action is optimal. This is implemented by sampling a single reward estimate from the posterior distribution of each arm and then choosing the arm with the highest sampled value. This elegant mechanism naturally balances exploration and exploitation.

THOMPSON SAMPLING

Frequently Asked Questions

A deep dive into the Bayesian algorithm for balancing exploration and exploitation in sequential decision-making, foundational to reinforcement learning and multi-armed bandit problems.

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems, where actions are selected by sampling from the posterior distribution over the optimal action and updating beliefs based on observed rewards. The core mechanism operates in a loop: for each decision round, the algorithm samples a single plausible reward model from the current posterior distribution for each possible action (e.g., each 'arm' of a multi-armed bandit). It then executes the action whose sampled model promises the highest immediate reward. After observing the actual reward, it updates the posterior distribution for that specific action using Bayes' theorem, refining its belief about that action's true reward distribution. This simple 'sample-then-update' process naturally balances exploration (trying uncertain actions) and exploitation (choosing actions believed to be best), as actions with high uncertainty have a wider posterior and thus a non-zero probability of being sampled as optimal.

Thompson Sampling

What is Thompson Sampling?

Key Features of Thompson Sampling

Bayesian Belief Representation

Probability Matching Action Selection

Frequently Asked Questions

Sequential, Online Posterior Updates

Optimal Regret Performance

Natural Handling of Contextual Bandits

Connection to Reinforcement Learning

Bayesian Inference

Conjugate Prior

Regret

Contextual Bandits

Thompson Sampling

What is Thompson Sampling?

Key Features of Thompson Sampling

Bayesian Belief Representation

Probability Matching Action Selection

Frequently Asked Questions

Related Terms

Multi-Armed Bandit Problem

Upper Confidence Bound (UCB)

Sequential, Online Posterior Updates

Optimal Regret Performance

Natural Handling of Contextual Bandits

Connection to Reinforcement Learning

Bayesian Inference

Conjugate Prior

Regret

Contextual Bandits