Inferensys

Glossary

Thompson Sampling

Thompson Sampling is a Bayesian algorithm for the exploration-exploitation tradeoff, where an agent selects actions by sampling from the posterior distribution over the optimal action and then taking the sampled action.
Developer reviewing multi-agent chat interface on laptop, agent conversation logs visible, casual coding session at WeWork desk.
FEEDBACK LOOP ENGINEERING

What is Thompson Sampling?

Thompson sampling is a foundational Bayesian algorithm for solving the exploration-exploitation tradeoff in sequential decision-making under uncertainty.

Thompson sampling is a Bayesian algorithm for the exploration-exploitation tradeoff, where an agent selects actions by sampling from the posterior distribution over the optimal action and then taking the sampled action. This elegantly balances trying uncertain options to gather information (exploration) with choosing the currently best-known option (exploitation). It is a core method in multi-armed bandit problems and is widely applied in clinical trials, website A/B testing, and reinforcement learning.

The algorithm operates by maintaining a probabilistic belief (the posterior) about the reward of each possible action. Before each decision, it draws a single sample from each action's reward distribution and selects the action with the highest sampled value. After observing the reward, it updates its belief using Bayesian inference. This simple, randomized policy is asymptotically optimal and often outperforms deterministic alternatives like Upper Confidence Bound (UCB) in practice due to its natural uncertainty quantification.

BAYESIAN BANDIT ALGORITHM

Key Features of Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation dilemma. It operates by maintaining a probability distribution (posterior) over the expected reward of each action and selecting actions by sampling from these distributions.

01

Bayesian Probability Matching

At its core, Thompson Sampling is a probability matching strategy. Instead of always choosing the action with the highest estimated mean reward, it selects actions with a probability equal to the likelihood that each action is optimal, given the current posterior belief. This elegantly encodes uncertainty directly into the decision-making process.

  • Mechanism: For each action (arm), the algorithm maintains a posterior distribution (e.g., Beta for Bernoulli rewards, Normal for Gaussian). To choose an action, it draws one random sample from each posterior and executes the action whose sampled value is highest.
  • Result: Actions with higher uncertainty are explored more often, but exploitation naturally increases as evidence accumulates and posteriors become more peaked.
02

Natural Handling of Uncertainty

The algorithm's power stems from its explicit, probabilistic representation of uncertainty. The width of the posterior distribution for an action quantifies how uncertain the agent is about its true reward rate.

  • Exploration Incentive: A wide posterior (high uncertainty) means the sampled value can vary greatly. This gives under-explored actions a chance to be selected, even if their current mean estimate is low.
  • Automatic Decay: As an action is tried and results are observed, its posterior distribution updates (via Bayes' rule), becoming narrower and centered closer to the true mean. Exploration of that action naturally diminishes over time without needing an explicit exploration schedule.
03

Contextual Bandit Extension

A major strength is its seamless extension to the contextual bandit setting, where side information (context) is available before each decision. This is foundational for modern recommendation systems and personalized AI.

  • How it works: Instead of independent posteriors per action, the algorithm uses a probabilistic model (e.g., Bayesian linear regression) that predicts reward based on the context. The posterior is over the model parameters.
  • Decision Process: Given a context vector, the algorithm samples a set of parameters from the posterior, uses them to predict a reward for each action, and selects the action with the highest predicted reward from this sampled model.
  • Example: A news website uses user features (context) to sample from a model of article click-through rates, personalizing exploration for each visitor.
04

Asymptotic Optimality & No Regret

Thompson Sampling provides strong theoretical performance guarantees. It is asymptotically optimal, meaning that as the number of trials grows infinitely large, its average regret (the loss from not always choosing the best action) converges to zero at an optimal rate.

  • Bayesian Regret: In a Bayesian framework, its cumulative regret over time is provably bounded. The algorithm efficiently reduces uncertainty where it matters most for decision quality.
  • Comparison to UCB: Unlike Upper Confidence Bound (UCCB) algorithms, which use a deterministic, optimism-in-the-face-of-uncertainty principle, Thompson Sampling's randomness often leads to better empirical performance in complex, non-stationary, or delayed-feedback environments.
05

Integration with Deep Learning

Modern implementations often combine Thompson Sampling with deep neural networks, creating powerful Deep Bayesian Bandits. This is achieved by using the network's stochasticity or specialized layers to approximate sampling from a complex posterior.

  • Bootstrapped Networks: A practical method uses an ensemble of neural networks, each trained on a different bootstrap sample of the data. Acting with a randomly selected network from the ensemble approximates Thompson Sampling.
  • Bayesian Neural Networks (BNNs): True BNNs maintain a distribution over network weights. Sampling a weight configuration provides a model sample for decision-making.
  • Application: Used in large-scale systems for ad placement, web page layout optimization, and clinical trial adaptive designs, where the action space and context are high-dimensional.
06

Robustness to Delayed and Batched Feedback

Thompson Sampling is particularly robust in real-world production environments where reward feedback is delayed or arrives in batches. This is a common challenge in systems like recommendation engines or medical trials.

  • Delayed Feedback: Because it samples from the current posterior, it can make decisions immediately, even if outcomes from prior decisions are still pending. The posterior is updated retrospectively when feedback arrives.
  • Batch Updates: The algorithm can operate in a batch setting, making a sequence of decisions based on a single posterior sample, then updating the posterior in one go after a batch of rewards is collected. This matches common MLOps training cycles.
  • Contrast with TD Methods: This differs from Temporal Difference (TD) methods in reinforcement learning, which often struggle with non-Markovian, delayed feedback without careful engineering.
EXPLORATION-EXPLOITATION TRADEOFF

Thompson Sampling vs. Other Exploration Strategies

A comparison of algorithmic approaches for balancing the exploration of uncertain actions with the exploitation of known high-reward actions in reinforcement learning and multi-armed bandit problems.

Feature / MechanismThompson Sampling (Bayesian Sampling)Epsilon-GreedyUpper Confidence Bound (UCB)

Core Principle

Probabilistic matching: sample from posterior, act on sample.

Heuristic switching: usually exploit, sometimes explore randomly.

Optimism under uncertainty: exploit action with highest statistical upper bound.

Mathematical Foundation

Bayesian inference; maintains a posterior distribution over action values.

Frequentist statistics; uses point estimates (sample means).

Frequentist statistics; uses confidence intervals derived from concentration inequalities (e.g., Hoeffding's).

Exploration Mechanism

Inherently randomized through posterior sampling. Exploration is proportional to uncertainty.

Forced, undirected randomness. Exploration is uniform over all actions.

Directed optimism. Adds an exploration bonus that decays with the square root of visits.

Handles Uncertainty

Adapts to Non-Stationary Environments

Requires decay schedules (e.g., decaying ε).

Requires specialized variants (e.g., discounted UCB, sliding-window UCB).

Computational Overhead

Moderate-High. Requires maintaining and sampling from posteriors (e.g., Beta, Gaussian).

Very Low. Simple arithmetic on counts and averages.

Low. Requires computing confidence bounds.

Theoretical Regret Bounds

Asymptotically optimal for Bernoulli bandits. Strong Bayesian regret bounds.

Linear regret if ε is constant; sublinear with decaying ε (logarithmic with optimal schedule).

Optimal minimax regret bounds (logarithmic).

Parameter Sensitivity

Low. Priors can be uninformative; algorithm is generally robust.

High. Performance heavily depends on choice and schedule of ε.

Moderate. Performance depends on scaling of confidence bound (theoretical constant often needs tuning).

Natural for Delayed/Batched Feedback

Integration with Deep RL (e.g., Deep TS)

Yes, via parameterized posterior networks (e.g., Bayesian neural networks, bootstrapped DQN).

Trivial. The default for many DQN implementations.

Possible but less common; requires uncertainty estimates from the network.

FEEDBACK LOOP ENGINEERING

Real-World Applications of Thompson Sampling

Thompson Sampling's Bayesian approach to balancing exploration and exploitation makes it a powerful algorithm for adaptive decision-making in dynamic, uncertain environments. Its applications span from optimizing digital user experiences to managing complex physical systems.

01

Website & Ad Content Optimization (A/B/n Testing)

Thompson Sampling is the modern, statistically efficient successor to traditional A/B testing for optimizing click-through rates (CTR) and conversion rates. Instead of splitting traffic evenly between variants for a fixed period, it dynamically allocates more traffic to better-performing options in real-time.

  • Key Mechanism: For each webpage element (e.g., headline, button color, image), the algorithm maintains a Beta distribution representing the belief about its conversion probability. It samples from these distributions to select a variant for each new user, observes the outcome (click/no-click), and updates the corresponding Beta posterior.
  • Advantages: Dramatically reduces the "opportunity cost" of showing suboptimal content during a test. It converges to the best variant faster than fixed-horizon tests, maximizing cumulative rewards (total conversions) over the testing period.
  • Example: A news site can use it to test 10 different article headlines, continuously learning which generates the most engagement without manual intervention.
02

Clinical Trial Adaptive Designs

In pharmaceutical research, Thompson Sampling enables adaptive clinical trials that can allocate more patients to more promising treatment arms while the trial is ongoing. This is a high-stakes application of the exploration-exploitation tradeoff, where exploration means learning about drug efficacy and exploitation means giving more patients access to the potentially best treatment.

  • Key Mechanism: Each treatment arm's efficacy and safety profile is modeled with a posterior distribution (often using a Bayesian logistic model). Patient allocation probabilities are proportional to the probability that each arm is the most effective, sampled from these posteriors.
  • Advantages: Increases the statistical power of trials, can lead to smaller required sample sizes, and has an ethical benefit by reducing the number of patients assigned to inferior treatments. It is particularly useful for multi-arm, multi-stage (MAMS) trials.
  • Consideration: Requires rigorous simulation and regulatory acceptance due to the complexity of maintaining trial blinding and controlling type-I error rates.
03

Recommender Systems & Personalization

Thompson Sampling provides a principled framework for the cold-start problem and ongoing personalization in recommendation engines. It balances showing popular items (exploitation) with showing new or niche items to learn user preferences (exploration).

  • Key Mechanism: For each user-item pair, a model maintains a distribution over the predicted rating or probability of engagement (e.g., using a Bayesian linear model or contextual bandit). When a user arrives, the system samples from the distributions for candidate items and recommends the item with the highest sampled value.
  • Advantages: Naturally handles non-stationary user preferences by continuously updating beliefs. It is more robust than greedy systems that can get stuck in feedback loops, always recommending the same initially-popular items.
  • Example: A music streaming service uses it to decide whether to recommend a user's established favorite artist or a new artist from a similar genre, learning the user's appetite for discovery over time.
04

Dynamic Pricing & Revenue Management

E-commerce platforms, ride-sharing services, and airlines use Thompson Sampling to set prices in real-time, maximizing revenue while learning price elasticity of demand. The algorithm explores different price points to understand their effect on conversion probability.

  • Key Mechanism: For each product or service context (time, location, demand level), a model defines a set of possible prices. Each price is associated with a posterior distribution over its expected profit (price * conversion probability). The algorithm samples from these profit distributions to select a price for each transaction.
  • Advantages: Automatically adapts to changes in market conditions, competitor pricing, and inventory levels. It avoids the pitfalls of simple rule-based systems that can be gamed or fail in volatile markets.
  • Example: A food delivery app dynamically prices delivery fees during peak hours, sampling from a model that balances the immediate fee revenue against the probability of losing the customer to a competitor.
05

Resource Allocation in Networks & Systems

Thompson Sampling is applied to optimize resource allocation in complex systems where the payoff of an allocation choice is noisy and the optimal configuration is unknown. This includes load balancing, network routing, and cloud computing resource management.

  • Key Mechanism: Each possible allocation (e.g., which server to route a request to, which channel to use for transmission) is treated as an arm. The reward is a performance metric like latency, throughput, or job completion time. The algorithm maintains posteriors over the expected reward for each arm and samples to make allocation decisions.
  • Advantages: Handles non-stationary environments well, such as servers becoming overloaded or network paths congesting. It provides a self-optimizing control loop without requiring a precise system model.
  • Example: A content delivery network (CDN) uses it to choose between multiple edge servers for a client request, learning which server currently provides the lowest latency for that client's region.
06

Robotics & Autonomous System Parameter Tuning

In robotics and industrial control, Thompson Sampling helps autonomously tune parameters (e.g., controller gains, gait parameters for walking robots) in physical systems where the relationship between parameters and performance is complex and evaluations are expensive or noisy.

  • Key Mechanism: The parameter space is discretized or modeled with a Gaussian Process. Each parameter set is an arm, and the reward is a measure of task performance (e.g., stability, speed, energy efficiency). The algorithm sequentially selects parameters to test on the physical system, updates the Bayesian model, and increasingly exploits the best-found settings.
  • Advantages: Safer and more sample-efficient than random search or grid search, as it builds a probabilistic model of performance. It is a form of Bayesian optimization that naturally handles the exploration-exploitation tradeoff inherent in tuning.
  • Example: An autonomous drone uses it to optimize the PID controller coefficients for a new payload weight, minimizing the number of potentially unstable test flights required.
THOMPSON SAMPLING

Frequently Asked Questions

Thompson sampling is a foundational Bayesian algorithm for solving the exploration-exploitation dilemma in sequential decision-making. These questions address its core mechanics, applications, and relationship to other feedback-driven systems.

Thompson sampling is a Bayesian algorithm for the exploration-exploitation tradeoff, where an agent selects actions by sampling from the posterior probability distribution over which action is optimal and then taking the sampled action. It works by maintaining a belief state—a probability distribution—over the unknown parameters of the reward model for each action (e.g., the probability of success for a Bernoulli bandit). For each decision, the agent draws a single sample from the posterior distribution for each action, imagines that this sampled parameter is the true one, and selects the action with the highest sampled value. After observing the reward, it updates its posterior belief using Bayes' theorem, making the algorithm inherently adaptive. This simple sample-then-act loop elegantly balances exploring uncertain actions and exploiting known high-reward actions based on the agent's current uncertainty.

Key Steps in a Bernoulli Bandit:

  1. Initialize Beliefs: Start with a prior distribution (e.g., Beta(1,1)) for each arm's success probability.
  2. Sample: For each arm, draw a sample θᵢ from its current Beta posterior.
  3. Act: Pull the arm with the highest sampled θᵢ.
  4. Observe: Receive a binary reward (success/failure).
  5. Update: Update that arm's Beta posterior: Beta(α + success, β + failure).
  6. Repeat.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.