Glossary

Thompson Sampling

Thompson Sampling is a Bayesian algorithm for the exploration-exploitation tradeoff, where an agent selects actions by sampling from the posterior distribution over the optimal action and then taking the sampled action.

Get in touch Learn more

Developer reviewing multi-agent chat interface on laptop, agent conversation logs visible, casual coding session at WeWork desk.

FEEDBACK LOOP ENGINEERING

What is Thompson Sampling?

Thompson sampling is a foundational Bayesian algorithm for solving the exploration-exploitation tradeoff in sequential decision-making under uncertainty.

Thompson sampling is a Bayesian algorithm for the exploration-exploitation tradeoff, where an agent selects actions by sampling from the posterior distribution over the optimal action and then taking the sampled action. This elegantly balances trying uncertain options to gather information (exploration) with choosing the currently best-known option (exploitation). It is a core method in multi-armed bandit problems and is widely applied in clinical trials, website A/B testing, and reinforcement learning.

The algorithm operates by maintaining a probabilistic belief (the posterior) about the reward of each possible action. Before each decision, it draws a single sample from each action's reward distribution and selects the action with the highest sampled value. After observing the reward, it updates its belief using Bayesian inference. This simple, randomized policy is asymptotically optimal and often outperforms deterministic alternatives like Upper Confidence Bound (UCB) in practice due to its natural uncertainty quantification.

BAYESIAN BANDIT ALGORITHM

Key Features of Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation dilemma. It operates by maintaining a probability distribution (posterior) over the expected reward of each action and selecting actions by sampling from these distributions.

Bayesian Probability Matching

At its core, Thompson Sampling is a probability matching strategy. Instead of always choosing the action with the highest estimated mean reward, it selects actions with a probability equal to the likelihood that each action is optimal, given the current posterior belief. This elegantly encodes uncertainty directly into the decision-making process.

Mechanism: For each action (arm), the algorithm maintains a posterior distribution (e.g., Beta for Bernoulli rewards, Normal for Gaussian). To choose an action, it draws one random sample from each posterior and executes the action whose sampled value is highest.
Result: Actions with higher uncertainty are explored more often, but exploitation naturally increases as evidence accumulates and posteriors become more peaked.

Natural Handling of Uncertainty

The algorithm's power stems from its explicit, probabilistic representation of uncertainty. The width of the posterior distribution for an action quantifies how uncertain the agent is about its true reward rate.

Exploration Incentive: A wide posterior (high uncertainty) means the sampled value can vary greatly. This gives under-explored actions a chance to be selected, even if their current mean estimate is low.
Automatic Decay: As an action is tried and results are observed, its posterior distribution updates (via Bayes' rule), becoming narrower and centered closer to the true mean. Exploration of that action naturally diminishes over time without needing an explicit exploration schedule.

Contextual Bandit Extension

A major strength is its seamless extension to the contextual bandit setting, where side information (context) is available before each decision. This is foundational for modern recommendation systems and personalized AI.

How it works: Instead of independent posteriors per action, the algorithm uses a probabilistic model (e.g., Bayesian linear regression) that predicts reward based on the context. The posterior is over the model parameters.
Decision Process: Given a context vector, the algorithm samples a set of parameters from the posterior, uses them to predict a reward for each action, and selects the action with the highest predicted reward from this sampled model.
Example: A news website uses user features (context) to sample from a model of article click-through rates, personalizing exploration for each visitor.

Asymptotic Optimality & No Regret

Thompson Sampling provides strong theoretical performance guarantees. It is asymptotically optimal, meaning that as the number of trials grows infinitely large, its average regret (the loss from not always choosing the best action) converges to zero at an optimal rate.

Bayesian Regret: In a Bayesian framework, its cumulative regret over time is provably bounded. The algorithm efficiently reduces uncertainty where it matters most for decision quality.
Comparison to UCB: Unlike Upper Confidence Bound (UCCB) algorithms, which use a deterministic, optimism-in-the-face-of-uncertainty principle, Thompson Sampling's randomness often leads to better empirical performance in complex, non-stationary, or delayed-feedback environments.

Integration with Deep Learning

Modern implementations often combine Thompson Sampling with deep neural networks, creating powerful Deep Bayesian Bandits. This is achieved by using the network's stochasticity or specialized layers to approximate sampling from a complex posterior.

Bootstrapped Networks: A practical method uses an ensemble of neural networks, each trained on a different bootstrap sample of the data. Acting with a randomly selected network from the ensemble approximates Thompson Sampling.
Bayesian Neural Networks (BNNs): True BNNs maintain a distribution over network weights. Sampling a weight configuration provides a model sample for decision-making.
Application: Used in large-scale systems for ad placement, web page layout optimization, and clinical trial adaptive designs, where the action space and context are high-dimensional.

Robustness to Delayed and Batched Feedback

Thompson Sampling is particularly robust in real-world production environments where reward feedback is delayed or arrives in batches. This is a common challenge in systems like recommendation engines or medical trials.

Delayed Feedback: Because it samples from the current posterior, it can make decisions immediately, even if outcomes from prior decisions are still pending. The posterior is updated retrospectively when feedback arrives.
Batch Updates: The algorithm can operate in a batch setting, making a sequence of decisions based on a single posterior sample, then updating the posterior in one go after a batch of rewards is collected. This matches common MLOps training cycles.
Contrast with TD Methods: This differs from Temporal Difference (TD) methods in reinforcement learning, which often struggle with non-Markovian, delayed feedback without careful engineering.

EXPLORATION-EXPLOITATION TRADEOFF

Thompson Sampling vs. Other Exploration Strategies

A comparison of algorithmic approaches for balancing the exploration of uncertain actions with the exploitation of known high-reward actions in reinforcement learning and multi-armed bandit problems.

Feature / Mechanism	Thompson Sampling (Bayesian Sampling)	Epsilon-Greedy	Upper Confidence Bound (UCB)
Core Principle	Probabilistic matching: sample from posterior, act on sample.	Heuristic switching: usually exploit, sometimes explore randomly.	Optimism under uncertainty: exploit action with highest statistical upper bound.
Mathematical Foundation	Bayesian inference; maintains a posterior distribution over action values.	Frequentist statistics; uses point estimates (sample means).	Frequentist statistics; uses confidence intervals derived from concentration inequalities (e.g., Hoeffding's).
Exploration Mechanism	Inherently randomized through posterior sampling. Exploration is proportional to uncertainty.	Forced, undirected randomness. Exploration is uniform over all actions.	Directed optimism. Adds an exploration bonus that decays with the square root of visits.
Handles Uncertainty
Adapts to Non-Stationary Environments		Requires decay schedules (e.g., decaying ε).	Requires specialized variants (e.g., discounted UCB, sliding-window UCB).
Computational Overhead	Moderate-High. Requires maintaining and sampling from posteriors (e.g., Beta, Gaussian).	Very Low. Simple arithmetic on counts and averages.	Low. Requires computing confidence bounds.
Theoretical Regret Bounds	Asymptotically optimal for Bernoulli bandits. Strong Bayesian regret bounds.	Linear regret if ε is constant; sublinear with decaying ε (logarithmic with optimal schedule).	Optimal minimax regret bounds (logarithmic).
Parameter Sensitivity	Low. Priors can be uninformative; algorithm is generally robust.	High. Performance heavily depends on choice and schedule of ε.	Moderate. Performance depends on scaling of confidence bound (theoretical constant often needs tuning).
Natural for Delayed/Batched Feedback
Integration with Deep RL (e.g., Deep TS)	Yes, via parameterized posterior networks (e.g., Bayesian neural networks, bootstrapped DQN).	Trivial. The default for many DQN implementations.	Possible but less common; requires uncertainty estimates from the network.

FEEDBACK LOOP ENGINEERING

Real-World Applications of Thompson Sampling

Thompson Sampling's Bayesian approach to balancing exploration and exploitation makes it a powerful algorithm for adaptive decision-making in dynamic, uncertain environments. Its applications span from optimizing digital user experiences to managing complex physical systems.

Website & Ad Content Optimization (A/B/n Testing)

Thompson Sampling is the modern, statistically efficient successor to traditional A/B testing for optimizing click-through rates (CTR) and conversion rates. Instead of splitting traffic evenly between variants for a fixed period, it dynamically allocates more traffic to better-performing options in real-time.

Key Mechanism: For each webpage element (e.g., headline, button color, image), the algorithm maintains a Beta distribution representing the belief about its conversion probability. It samples from these distributions to select a variant for each new user, observes the outcome (click/no-click), and updates the corresponding Beta posterior.
Advantages: Dramatically reduces the "opportunity cost" of showing suboptimal content during a test. It converges to the best variant faster than fixed-horizon tests, maximizing cumulative rewards (total conversions) over the testing period.
Example: A news site can use it to test 10 different article headlines, continuously learning which generates the most engagement without manual intervention.

Clinical Trial Adaptive Designs

In pharmaceutical research, Thompson Sampling enables adaptive clinical trials that can allocate more patients to more promising treatment arms while the trial is ongoing. This is a high-stakes application of the exploration-exploitation tradeoff, where exploration means learning about drug efficacy and exploitation means giving more patients access to the potentially best treatment.

Key Mechanism: Each treatment arm's efficacy and safety profile is modeled with a posterior distribution (often using a Bayesian logistic model). Patient allocation probabilities are proportional to the probability that each arm is the most effective, sampled from these posteriors.
Advantages: Increases the statistical power of trials, can lead to smaller required sample sizes, and has an ethical benefit by reducing the number of patients assigned to inferior treatments. It is particularly useful for multi-arm, multi-stage (MAMS) trials.
Consideration: Requires rigorous simulation and regulatory acceptance due to the complexity of maintaining trial blinding and controlling type-I error rates.

Recommender Systems & Personalization

Thompson Sampling provides a principled framework for the cold-start problem and ongoing personalization in recommendation engines. It balances showing popular items (exploitation) with showing new or niche items to learn user preferences (exploration).

Key Mechanism: For each user-item pair, a model maintains a distribution over the predicted rating or probability of engagement (e.g., using a Bayesian linear model or contextual bandit). When a user arrives, the system samples from the distributions for candidate items and recommends the item with the highest sampled value.
Advantages: Naturally handles non-stationary user preferences by continuously updating beliefs. It is more robust than greedy systems that can get stuck in feedback loops, always recommending the same initially-popular items.
Example: A music streaming service uses it to decide whether to recommend a user's established favorite artist or a new artist from a similar genre, learning the user's appetite for discovery over time.

Dynamic Pricing & Revenue Management

E-commerce platforms, ride-sharing services, and airlines use Thompson Sampling to set prices in real-time, maximizing revenue while learning price elasticity of demand. The algorithm explores different price points to understand their effect on conversion probability.

Key Mechanism: For each product or service context (time, location, demand level), a model defines a set of possible prices. Each price is associated with a posterior distribution over its expected profit (price * conversion probability). The algorithm samples from these profit distributions to select a price for each transaction.
Advantages: Automatically adapts to changes in market conditions, competitor pricing, and inventory levels. It avoids the pitfalls of simple rule-based systems that can be gamed or fail in volatile markets.
Example: A food delivery app dynamically prices delivery fees during peak hours, sampling from a model that balances the immediate fee revenue against the probability of losing the customer to a competitor.

Resource Allocation in Networks & Systems

Thompson Sampling is applied to optimize resource allocation in complex systems where the payoff of an allocation choice is noisy and the optimal configuration is unknown. This includes load balancing, network routing, and cloud computing resource management.

Key Mechanism: Each possible allocation (e.g., which server to route a request to, which channel to use for transmission) is treated as an arm. The reward is a performance metric like latency, throughput, or job completion time. The algorithm maintains posteriors over the expected reward for each arm and samples to make allocation decisions.
Advantages: Handles non-stationary environments well, such as servers becoming overloaded or network paths congesting. It provides a self-optimizing control loop without requiring a precise system model.
Example: A content delivery network (CDN) uses it to choose between multiple edge servers for a client request, learning which server currently provides the lowest latency for that client's region.

Robotics & Autonomous System Parameter Tuning

In robotics and industrial control, Thompson Sampling helps autonomously tune parameters (e.g., controller gains, gait parameters for walking robots) in physical systems where the relationship between parameters and performance is complex and evaluations are expensive or noisy.

Key Mechanism: The parameter space is discretized or modeled with a Gaussian Process. Each parameter set is an arm, and the reward is a measure of task performance (e.g., stability, speed, energy efficiency). The algorithm sequentially selects parameters to test on the physical system, updates the Bayesian model, and increasingly exploits the best-found settings.
Advantages: Safer and more sample-efficient than random search or grid search, as it builds a probabilistic model of performance. It is a form of Bayesian optimization that naturally handles the exploration-exploitation tradeoff inherent in tuning.
Example: An autonomous drone uses it to optimize the PID controller coefficients for a new payload weight, minimizing the number of potentially unstable test flights required.

THOMPSON SAMPLING

Frequently Asked Questions

Thompson sampling is a foundational Bayesian algorithm for solving the exploration-exploitation dilemma in sequential decision-making. These questions address its core mechanics, applications, and relationship to other feedback-driven systems.

Thompson sampling is a Bayesian algorithm for the exploration-exploitation tradeoff, where an agent selects actions by sampling from the posterior probability distribution over which action is optimal and then taking the sampled action. It works by maintaining a belief state—a probability distribution—over the unknown parameters of the reward model for each action (e.g., the probability of success for a Bernoulli bandit). For each decision, the agent draws a single sample from the posterior distribution for each action, imagines that this sampled parameter is the true one, and selects the action with the highest sampled value. After observing the reward, it updates its posterior belief using Bayes' theorem, making the algorithm inherently adaptive. This simple sample-then-act loop elegantly balances exploring uncertain actions and exploiting known high-reward actions based on the agent's current uncertainty.

Key Steps in a Bernoulli Bandit:

Initialize Beliefs: Start with a prior distribution (e.g., Beta(1,1)) for each arm's success probability.
Sample: For each arm, draw a sample θᵢ from its current Beta posterior.
Act: Pull the arm with the highest sampled θᵢ.
Observe: Receive a binary reward (success/failure).
Update: Update that arm's Beta posterior: Beta(α + success, β + failure).
Repeat.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

FEEDBACK LOOP ENGINEERING

Related Terms

Thompson Sampling is a core algorithm for managing the exploration-exploitation tradeoff in adaptive systems. These related concepts define the broader landscape of reinforcement learning and decision-making frameworks.

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) algorithm is a deterministic method for the exploration-exploitation tradeoff. It selects actions by calculating an optimism-in-the-face-of-uncertainty index: the current estimated reward for an action plus an exploration bonus. This bonus is proportional to the logarithm of the total number of plays divided by the number of times that specific action was taken. Unlike Thompson Sampling's probabilistic sampling, UCB provides a deterministic, mathematically derived rule that guarantees sub-linear regret bounds in stochastic environments.

EXPLORE

Epsilon-Greedy

Epsilon-Greedy is a simple, heuristic exploration strategy. With probability 1-ε, the agent exploits by choosing the action with the highest known empirical reward. With probability ε, it explores by choosing a random action uniformly. While easy to implement, it is less efficient than Bayesian methods like Thompson Sampling because its exploration is undirected—it does not account for the uncertainty or potential value of non-optimal actions. It often serves as a baseline algorithm in multi-armed bandit problems.

Contextual Bandits

A Contextual Bandit extends the classic multi-armed bandit problem by introducing context or side information available before each decision. The goal is to learn a policy that maps contexts to actions to maximize cumulative reward. Thompson Sampling is readily adapted to this setting by using a Bayesian linear model (or other function approximator) where parameters have a posterior distribution. For each context, an action is chosen by sampling model parameters from the posterior and selecting the action with the highest predicted reward under the sampled parameters.

Bayesian Optimization

Bayesian Optimization is a global optimization strategy for expensive black-box functions, closely related to bandit problems. It uses a probabilistic surrogate model (like a Gaussian Process) to model the objective function. An acquisition function, which balances exploration and exploitation, guides the selection of the next point to evaluate. Thompson Sampling is one such acquisition strategy, where the next query point is chosen by sampling a function from the posterior of the surrogate model and maximizing the sampled function. This makes it effective for hyperparameter tuning.

Posterior Sampling for Reinforcement Learning (PSRL)

Posterior Sampling for Reinforcement Learning is an algorithm that applies the Thompson Sampling principle to full Markov Decision Processes (MDPs). Instead of maintaining a posterior over arms, the agent maintains a posterior distribution over the MDP's dynamics and reward models. At the start of each episode, a single MDP is sampled from this posterior. The agent then acts optimally in the sampled MDP for the duration of the episode. PSRL achieves near-optimal Bayesian regret bounds and elegantly handles the exploration-exploitation tradeoff in complex, stateful environments.

Gittins Index

The Gittins Index provides the optimal solution to the Bayesian multi-armed bandit problem with geometric discounting for a risk-neutral agent. It assigns a dynamic priority index to each arm, and the optimal policy is to always play the arm with the highest current index. While theoretically optimal, the index is often computationally intractable to calculate exactly for complex priors. Thompson Sampling is a computationally efficient approximation that, while not always index-optimal, is simple to implement and performs remarkably well in practice, often matching or exceeding the performance of Gittins-index policies.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Thompson Sampling

What is Thompson Sampling?

Key Features of Thompson Sampling

Bayesian Probability Matching

Natural Handling of Uncertainty

Contextual Bandit Extension

Asymptotic Optimality & No Regret

Integration with Deep Learning

Robustness to Delayed and Batched Feedback

Thompson Sampling vs. Other Exploration Strategies

Real-World Applications of Thompson Sampling

Website & Ad Content Optimization (A/B/n Testing)

Clinical Trial Adaptive Designs

Recommender Systems & Personalization

Dynamic Pricing & Revenue Management

Resource Allocation in Networks & Systems

Robotics & Autonomous System Parameter Tuning

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Upper Confidence Bound (UCB)

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there