Inferensys

Glossary

Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems by sampling from posterior distributions to select actions.
Cinematic overhead of a WeWork creative suite room with multiple curved monitors showing AI decision dashboards, executives in casual attire reviewing data, dramatic pendant lighting.
BAYESIAN BANDIT ALGORITHM

What is Thompson Sampling?

Thompson Sampling is a foundational Bayesian algorithm for solving the exploration-exploitation dilemma in sequential decision-making.

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems, such as the multi-armed bandit, where actions are selected by sampling from the posterior probability distribution over which action is optimal and then updating beliefs based on observed rewards. This elegant approach naturally balances trying uncertain actions (exploration) with favoring actions believed to be best (exploitation) by treating the problem as one of Bayesian inference. It is mathematically equivalent to probability matching.

The algorithm operates by maintaining a prior distribution over the expected reward of each action (e.g., a Beta distribution for Bernoulli rewards). On each trial, it samples a potential reward value from each action's current posterior. The action with the highest sampled value is selected and executed. Upon observing the actual reward, the corresponding action's posterior distribution is updated using Bayes' theorem. This process makes it highly effective for online learning and is a core component in Bayesian optimization and certain model-based reinforcement learning systems.

ALGORITHMIC MECHANISMS

Key Features of Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off. Its core features derive from its probabilistic approach to action selection and belief updating.

01

Bayesian Belief Representation

Thompson Sampling maintains a posterior probability distribution over the expected reward for each possible action. This distribution, often a Beta distribution for Bernoulli rewards or a Gaussian for continuous rewards, encodes the algorithm's uncertainty. A wide distribution indicates high uncertainty (encouraging exploration), while a narrow, peaked distribution indicates high confidence (encouraging exploitation).

02

Probability Matching Action Selection

The algorithm does not simply choose the action with the highest estimated mean reward. Instead, it performs probability matching: it selects an action with a probability equal to the posterior probability that said action is optimal. This is implemented by sampling a single reward estimate from the posterior distribution of each arm and then choosing the arm with the highest sampled value. This elegant mechanism naturally balances exploration and exploitation.

03

Sequential, Online Posterior Updates

Beliefs are updated sequentially and online after each action and observed reward. For a Bernoulli bandit (success/failure rewards), if a Beta(α, β) prior is used:

  • A success increments α.
  • A failure increments β. This Bayesian update ensures the posterior distribution incorporates all historical data, continuously refining the estimate of each action's reward probability. The algorithm requires no separate exploration phase.
04

Optimal Regret Performance

Thompson Sampling achieves asymptotically optimal regret bounds for standard multi-armed bandit problems. Regret is the difference between the cumulative reward of the optimal strategy and the algorithm's reward. For Bernoulli bandits, its regret grows logarithmically with the number of rounds, matching the theoretical lower bound. This makes it highly sample-efficient in practice.

05

Natural Handling of Contextual Bandits

The framework extends naturally to contextual bandits, where side information (context) is available before each decision. A probabilistic model (e.g., a Bayesian linear regression) predicts rewards given the context. Thompson Sampling operates by sampling a parameter vector from the posterior of this model and selecting the action that maximizes the reward under the sampled parameters. This is foundational for modern personalized recommendation systems.

06

Connection to Reinforcement Learning

In model-based reinforcement learning, Thompson Sampling can be applied at the level of the environment model. An agent maintains a posterior distribution over possible world models (e.g., over transition dynamics). It samples a model, plans an optimal policy under that sampled model, acts, and then updates the model posterior with the observed outcome. This is known as Posterior Sampling for Reinforcement Learning (PSRL).

THOMPSON SAMPLING

Frequently Asked Questions

A deep dive into the Bayesian algorithm for balancing exploration and exploitation in sequential decision-making, foundational to reinforcement learning and multi-armed bandit problems.

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation trade-off in sequential decision problems, where actions are selected by sampling from the posterior distribution over the optimal action and updating beliefs based on observed rewards. The core mechanism operates in a loop: for each decision round, the algorithm samples a single plausible reward model from the current posterior distribution for each possible action (e.g., each 'arm' of a multi-armed bandit). It then executes the action whose sampled model promises the highest immediate reward. After observing the actual reward, it updates the posterior distribution for that specific action using Bayes' theorem, refining its belief about that action's true reward distribution. This simple 'sample-then-update' process naturally balances exploration (trying uncertain actions) and exploitation (choosing actions believed to be best), as actions with high uncertainty have a wider posterior and thus a non-zero probability of being sampled as optimal.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.