Inferensys

Glossary

Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation dilemma in multi-armed bandit problems by sampling actions from posterior reward distributions.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
RECURSIVE SELF-IMPROVEMENT

What is Thompson Sampling?

Thompson Sampling is a foundational algorithm for balancing exploration and exploitation in sequential decision-making problems, particularly the multi-armed bandit.

Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation dilemma in the multi-armed bandit problem. An agent selects an action by first sampling a potential reward value from the posterior probability distribution maintained for each available arm (or action) and then choosing the arm that yielded the highest sampled value. This stochastic mechanism naturally balances trying uncertain options (exploration) and favoring options believed to be best (exploitation).

The algorithm begins with a prior distribution (e.g., Beta for binary rewards, Gaussian for continuous) for each arm's unknown reward rate. After each action and observed reward, it performs a Bayesian update to refine the corresponding posterior. This approach is probabilistically optimal and widely applied in contextual bandits, recommendation systems, and hyperparameter optimization, where it efficiently navigates large search spaces. Its connection to recursive self-improvement lies in its use as a meta-algorithm for guiding an agent's own learning and adaptation processes.

MULTI-ARMED BANDIT ALGORITHM

Key Characteristics of Thompson Sampling

Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation trade-off by selecting actions based on samples drawn from posterior probability distributions.

01

Bayesian Probability Matching

The core mechanism of Thompson Sampling is probability matching. Instead of always choosing the arm with the highest estimated mean reward (pure exploitation), it selects each arm with a probability equal to that arm being the optimal one, given the current uncertainty. This is achieved by:

  • Maintaining a posterior distribution (e.g., Beta for Bernoulli rewards, Gaussian for normal rewards) for the expected reward of each arm.
  • On each trial, sampling a single value from each arm's posterior distribution.
  • Choosing the arm whose sampled value is the highest. This stochastic selection naturally balances exploration (sampling uncertain arms) and exploitation (sampling promising arms).
02

Automatic Uncertainty Quantification

Thompson Sampling inherently quantifies and responds to uncertainty through its Bayesian framework. The variance of the posterior distribution for each arm directly represents the algorithm's uncertainty about that arm's true reward rate.

  • High Variance (Early Stages): For an arm with few pulls, the posterior is wide. Random samples from this distribution can vary widely, giving it a significant chance of being selected over a known-good arm, promoting exploration.
  • Low Variance (Later Stages): As an arm is pulled more, its posterior distribution tightens around the empirical mean. Its sampled values become consistent, and it is only selected if its mean is truly high, leading to exploitation. This dynamic adjustment requires no external tuning of an exploration parameter (like ε in ε-greedy).
03

Asymptotic Optimality & Regret Bounds

Thompson Sampling provides strong theoretical performance guarantees. For the standard K-armed bandit problem, it achieves logarithmic expected cumulative regret under certain conditions, matching the asymptotic lower bound and making it asymptotically optimal.

  • Cumulative Regret: The total difference in reward between always pulling the optimal arm and the algorithm's choices. Thompson Sampling's regret grows as O(log T).
  • Frequentist Analysis: Modern analyses have proven these bounds without relying on a 'correct' Bayesian prior, showing robustness.
  • Contextual Bandits: Extensions to linear contextual bandits also achieve near-optimal regret bounds (O(√T) or O(log T) depending on assumptions), making it applicable to personalized recommendations and dynamic treatment assignment.
04

Computational & Implementation Simplicity

Despite its Bayesian foundation, Thompson Sampling is often computationally straightforward to implement and scale.

  • Conjugate Priors: Using conjugate prior-likelihood pairs (e.g., Beta-Bernoulli, Gaussian-Gaussian) allows the posterior to be updated with simple arithmetic upon observing a reward, requiring only O(1) computation per update.
  • Parallelization: Sampling and selection for each arm is independent, making it easy to parallelize across a large set of arms.
  • Comparison to UCB: Unlike Upper Confidence Bound (UCB) algorithms, which require calculating confidence intervals, Thompson Sampling only requires drawing random samples, which can be more efficient in complex models (e.g., deep neural network-based bandits where sampling from a posterior approximation is simpler than computing high-dimensional confidence sets).
05

Natural Extension to Complex Models

The sampling-based approach generalizes elegantly beyond simple bandits to complex, real-world problems.

  • Contextual Bandits with Deep Learning (Neural Thompson Sampling): The reward function is modeled by a deep neural network. A posterior over network weights is approximated (e.g., via bootstrapping, dropout as approximate Bayesian inference, or Laplace approximation). An action is chosen by sampling a network from this posterior and selecting the context-action pair with the highest predicted reward.
  • Non-Stationary Environments: Can adapt to changing reward distributions by using forgetting mechanisms like discounting old observations or using change-point models within the Bayesian update.
  • Hierarchical Models: Naturally incorporates shared structure between arms (e.g., in multi-task or meta-learning bandit settings) through hierarchical priors, allowing for efficient knowledge transfer.
06

Relationship to Other Bandit Algorithms

Thompson Sampling occupies a distinct point in the design space of bandit algorithms.

  • vs. ε-Greedy: ε-greedy explores randomly with a fixed probability ε, ignoring uncertainty. Thompson Sampling explores intelligently, proportionally to the probability an arm is optimal.
  • vs. Upper Confidence Bound (UCB): UCB is a deterministic, optimism-in-the-face-of-uncertainty principle. It selects the arm with the highest upper confidence bound. Thompson Sampling is randomized and directly samples from the posterior. The two are often viewed as dual perspectives (frequentist optimism vs. Bayesian probability matching) and can have similar regret bounds.
  • vs. Gittins Index: The Gittins Index provides an optimal solution for the discounted infinite-horizon Bayesian bandit problem but is often computationally intractable. Thompson Sampling is a simple, near-optimal heuristic for both discounted and finite-horizon settings.
EXPLORATION-EXPLOITATION TRADEOFF

Thompson Sampling vs. Other Bandit Algorithms

A comparison of heuristic strategies for solving the multi-armed bandit problem, focusing on their approach to balancing the exploration of uncertain options with the exploitation of known rewards.

Feature / MechanismThompson Sampling (Bayesian Bandit)ε-GreedyUpper Confidence Bound (UCB)Gradient Bandit Algorithm

Core Selection Principle

Sample from posterior, choose max sample

Choose random arm with probability ε, else choose best empirical mean

Choose arm maximizing empirical mean + confidence bound

Choose arm with probability from a softmax of preference scores

Underlying Framework

Bayesian probability (explicit posterior distributions)

Frequentist statistics (empirical averages)

Frequentist statistics with concentration inequalities

Stochastic gradient ascent on expected reward

Exploration Strategy

Probability matching via posterior sampling

Undirected, random exploration

Optimistic exploration of plausible best arms

Directed exploration via preference updates

Handles Delayed Feedback

Incorporates Prior Knowledge

Theoretical Regret Bounds

Asymptotically optimal for Bernoulli rewards

Linear regret if ε is fixed

Logarithmic problem-dependent regret

Sublinear regret with appropriate step-size

Computational Overhead

Medium (requires maintaining & sampling from posteriors)

Low (requires only mean estimates)

Low (requires mean estimates and visit counts)

Low (requires preference scores and a baseline)

Natural for Non-Stationary Environments

THOMPSON SAMPLING

Frequently Asked Questions

Thompson Sampling is a foundational algorithm for balancing exploration and exploitation, crucial for autonomous systems that must learn optimal actions through trial and error. These FAQs address its core mechanisms, applications, and relationship to broader concepts in agentic and recursive systems.

Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation dilemma in sequential decision-making problems, most classically the multi-armed bandit. It works by maintaining a posterior probability distribution over the expected reward for each possible action (or 'arm'). To select an action, the algorithm samples a single reward value from each action's posterior distribution and then executes the action with the highest sampled value. After observing the actual reward, it updates the posterior distribution for that action using Bayes' rule, refining its beliefs. This simple probability matching strategy naturally balances trying uncertain actions (exploration) and favoring actions believed to be best (exploitation).

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.