Inferensys

Glossary

Thompson Sampling

Thompson sampling is a Bayesian algorithm for solving the multi-armed bandit problem that selects actions by sampling from posterior probability distributions of each variant's reward.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
A/B TESTING FRAMEWORKS

What is Thompson Sampling?

Thompson sampling is a Bayesian algorithm for solving the multi-armed bandit problem that selects an action by sampling from the posterior probability distribution of each variant's reward and choosing the one with the highest sampled value.

Thompson sampling is a Bayesian algorithm for solving the multi-armed bandit problem, a sequential decision-making framework that balances exploration of uncertain options with exploitation of the best-known option. Unlike traditional A/B testing, which uses fixed traffic splits, it dynamically allocates traffic by sampling from a posterior probability distribution for each variant's reward and selecting the variant with the highest sampled value, naturally favoring more promising options over time.

The algorithm begins with a prior belief (e.g., a Beta distribution) about each variant's performance. After each user interaction, it observes the outcome (e.g., a click or conversion) and updates the corresponding variant's posterior distribution using Bayesian inference. This continuous update allows it to adapt allocation in real-time, making it highly efficient for optimizing metrics like click-through rate. It inherently addresses the explore-exploit trade-off and mitigates the peeking problem associated with frequentist sequential testing.

BAYESIAN BANDIT ALGORITHM

Key Features of Thompson Sampling

Thompson Sampling is a probability-matching algorithm for the multi-armed bandit problem that balances exploration and exploitation by sampling from posterior distributions. Its core features stem from its Bayesian foundation.

01

Bayesian Probability Matching

At its core, Thompson Sampling is a probability matching algorithm. It selects an action with a probability equal to that action being the optimal one, given the current uncertainty. This is achieved by:

  • Maintaining a posterior probability distribution for the expected reward of each variant (arm).
  • On each decision, sampling a single value from each variant's posterior.
  • Choosing the variant with the highest sampled value. This elegantly encodes uncertainty—a variant with a wide posterior (high uncertainty) has a higher chance of being sampled as the best, even if its current mean estimate is low.
02

Automatic Exploration-Exploitation Trade-off

The algorithm dynamically balances exploration (trying uncertain options) and exploitation (choosing the best-known option) without requiring an explicit tuning parameter like the epsilon in ε-greedy methods. The trade-off is inherent:

  • Exploitation: A variant with a high mean and low variance will consistently yield high samples.
  • Exploration: A variant with high variance (uncertainty) will occasionally produce a high sample, prompting a trial. As data accumulates, posterior distributions sharpen, and exploration naturally diminishes, converging to the optimal variant.
03

Contextual Bandit Extension

A powerful extension is the Contextual Thompson Sampling, which incorporates feature vectors (context) about each decision opportunity. For each arm, it models the reward as a function of the context, typically using a Bayesian linear model.

  • The algorithm maintains a posterior over the model's weight vectors.
  • For a given context, it samples a weight vector from the posterior for each arm, predicts a reward, and selects the arm with the highest sampled prediction. This allows for personalized decisions, making it highly effective in recommendation systems and dynamic treatment assignment.
04

Natural Handling of Delayed Feedback

Thompson Sampling integrates naturally with delayed or batched feedback, a common challenge in online systems. The Bayesian update mechanism is agnostic to the timing of data arrival.

  • Observations (rewards) can be incorporated into the posterior distributions as they arrive, regardless of delay.
  • The algorithm continues to make decisions based on the most recent posterior state. This contrasts with some frequentist bandit algorithms that may require complex adjustments for delayed rewards.
05

Computational Efficiency with Conjugate Priors

For common reward models, Thompson Sampling can be extremely computationally efficient through the use of conjugate priors.

  • Binary rewards (click/no-click): Use a Beta distribution prior, which updates to a Beta posterior with simple count updates.
  • Gaussian rewards: Use a Normal-Gamma or Normal-Inverse-Gamma prior for conjugate updates of the mean and variance. This allows for constant-time updates per observation, making the algorithm scalable for real-time systems with thousands of arms.
06

Asymptotic Optimality and Regret Bounds

Thompson Sampling provides strong theoretical performance guarantees. It is asymptotically optimal, meaning its cumulative regret grows at the best possible rate as the number of trials increases.

  • Cumulative Regret: The total loss incurred from not always playing the optimal arm. For Bernoulli bandits, Thompson Sampling achieves regret that is O(log T).
  • These bounds are derived from its probability-matching property, which ensures it does not under-explore the optimal arm. The algorithm's empirical performance often matches or exceeds that of Upper Confidence Bound (UCB) methods.
COMPARISON

Thompson Sampling vs. Other Bandit Algorithms

A feature and operational comparison of Thompson Sampling against other prominent multi-armed bandit algorithms, highlighting differences in statistical approach, exploration strategy, and practical implementation.

Feature / MetricThompson Sampling (Bayesian)Epsilon-GreedyUpper Confidence Bound (UCB)

Core Statistical Foundation

Bayesian inference; samples from posterior reward distributions

Frequentist; uses empirical average rewards

Frequentist; uses confidence bounds derived from concentration inequalities

Primary Exploration Mechanism

Probability matching via posterior sampling

Random exploration with fixed probability (ε)

Optimistic exploration of upper confidence bounds

Adapts Exploration Rate Automatically

Explicitly Models Reward Uncertainty

Handles Delayed or Batched Feedback

Computational Overhead per Decision

Medium (requires sampling from posteriors)

Low (requires maintaining averages)

Low (requires calculating confidence bounds)

Typical Regret Performance

Asymptotically optimal; often superior empirically

Suboptimal; linear regret if ε is constant

Asymptotically optimal for specific bound formulations

Requires Tuning of Hyperparameters

Prior parameters (often uninformative)

Exploration rate (ε)

Confidence bound parameter (α or c)

Natural Incorporation of Prior Knowledge

Common Use Case

Contextual bandits, recommendation systems, clinical trials

Simple A/B testing with basic exploration

Stochastic bandits with well-defined reward distributions

THOMPSON SAMPLING

Real-World Applications

Thompson Sampling's Bayesian approach to balancing exploration and exploitation makes it a powerful algorithm for optimizing decisions in dynamic, uncertain environments. Its applications span from digital interfaces to complex physical systems.

01

Website & UI Optimization

Thompson Sampling is a cornerstone of modern conversion rate optimization and personalization engines. It dynamically tests variations of web page elements (like headlines, button colors, or layouts) to maximize user engagement metrics.

  • Key Use Cases: Optimizing checkout flows, article recommendations, and ad placements.
  • Advantage over A/B Testing: Continuously reallocates traffic to better-performing variants, minimizing opportunity cost during experiments.
  • Real Example: Major e-commerce platforms use it to personalize homepages, showing different product assortments to maximize session value.
02

Clinical Trial Adaptive Design

In adaptive clinical trials, Thompson Sampling helps allocate patients to different treatment arms (e.g., drug dosage levels or novel therapies) to ethically maximize positive outcomes during the study.

  • Ethical Imperative: It inherently favors treatments showing early promise, potentially benefiting more participants.
  • Operational Efficiency: Can reduce the required sample size and trial duration compared to fixed randomization.
  • Consideration: Requires careful regulatory compliance and simulation to manage operational complexities.
03

Recommendation & Content Systems

Used to manage the explore-exploit trade-off in recommender systems. It selects which items (movies, products, news articles) to suggest from a large, changing catalog to maximize long-term user satisfaction.

  • Cold-Start Problem: Efficiently introduces new items with uncertain appeal into the recommendation pool.
  • Dynamic Catalog: Adapts to changing item popularity and user taste drift over time.
  • Architecture: Often implemented as a contextual bandit, where the algorithm uses user features (context) to personalize the sampling decision.
05

Financial Portfolio Optimization

Applied as a Bayesian portfolio strategy to dynamically allocate capital across a set of assets (the 'arms') with uncertain, non-stationary returns.

  • Sequential Decision-Making: Treats each rebalancing period as a pull of a bandit arm, where the reward is the asset's return.
  • Handles Uncertainty: Naturally incorporates prior beliefs about asset performance and updates them with observed market data.
  • Risk Management: The sampling mechanism inherently explores to avoid over-committing to assets based on potentially spurious short-term trends.
06

Industrial Process & Pricing Optimization

Optimizes parameters in complex systems where the reward function is noisy or expensive to evaluate.

  • Smart Manufacturing: Tuning machine settings (e.g., temperature, pressure) to maximize yield or quality, where each experiment is costly.
  • Dynamic Pricing: Testing different price points for products or services in real-time to maximize revenue, while learning price elasticity.
  • Key Benefit: Its sample-based action selection is robust to reward noise and non-Gaussian distributions common in industrial data.
THOMPSON SAMPLING

Frequently Asked Questions

Thompson sampling is a foundational Bayesian algorithm for solving the multi-armed bandit problem, balancing exploration and exploitation in online decision-making. These FAQs address its core mechanisms, applications, and practical considerations for CTOs and product managers implementing A/B testing frameworks.

Thompson sampling is a Bayesian algorithm for solving the multi-armed bandit problem that selects an action by sampling a potential reward value from the posterior probability distribution of each variant and choosing the variant with the highest sampled value. It works through a continuous loop: 1) For each variant (e.g., a webpage layout or model configuration), it maintains a posterior distribution over its expected reward rate (like click-through rate). 2) On each user visit, it draws one random sample from each variant's posterior. 3) It serves the variant with the highest sampled value. 4) It observes the binary outcome (click/no-click, conversion/no-conversion) and updates that variant's posterior distribution using Bayes' theorem, typically with a Beta-Binomial conjugate prior for binary rewards. This process inherently balances exploration (trying variants with uncertain, but potentially high, rewards) and exploitation (using the variant currently believed to be best).

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.