Glossary

Thompson Sampling

Thompson Sampling is a Bayesian algorithm for solving the exploration-exploitation dilemma in multi-armed bandit problems by sampling actions from posterior reward distributions.

Get in touch Learn more

Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.

RECURSIVE SELF-IMPROVEMENT

What is Thompson Sampling?

Thompson Sampling is a foundational algorithm for balancing exploration and exploitation in sequential decision-making problems, particularly the multi-armed bandit.

Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation dilemma in the multi-armed bandit problem. An agent selects an action by first sampling a potential reward value from the posterior probability distribution maintained for each available arm (or action) and then choosing the arm that yielded the highest sampled value. This stochastic mechanism naturally balances trying uncertain options (exploration) and favoring options believed to be best (exploitation).

The algorithm begins with a prior distribution (e.g., Beta for binary rewards, Gaussian for continuous) for each arm's unknown reward rate. After each action and observed reward, it performs a Bayesian update to refine the corresponding posterior. This approach is probabilistically optimal and widely applied in contextual bandits, recommendation systems, and hyperparameter optimization, where it efficiently navigates large search spaces. Its connection to recursive self-improvement lies in its use as a meta-algorithm for guiding an agent's own learning and adaptation processes.

MULTI-ARMED BANDIT ALGORITHM

Key Characteristics of Thompson Sampling

Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation trade-off by selecting actions based on samples drawn from posterior probability distributions.

Bayesian Probability Matching

The core mechanism of Thompson Sampling is probability matching. Instead of always choosing the arm with the highest estimated mean reward (pure exploitation), it selects each arm with a probability equal to that arm being the optimal one, given the current uncertainty. This is achieved by:

Maintaining a posterior distribution (e.g., Beta for Bernoulli rewards, Gaussian for normal rewards) for the expected reward of each arm.
On each trial, sampling a single value from each arm's posterior distribution.
Choosing the arm whose sampled value is the highest. This stochastic selection naturally balances exploration (sampling uncertain arms) and exploitation (sampling promising arms).

Automatic Uncertainty Quantification

Thompson Sampling inherently quantifies and responds to uncertainty through its Bayesian framework. The variance of the posterior distribution for each arm directly represents the algorithm's uncertainty about that arm's true reward rate.

High Variance (Early Stages): For an arm with few pulls, the posterior is wide. Random samples from this distribution can vary widely, giving it a significant chance of being selected over a known-good arm, promoting exploration.
Low Variance (Later Stages): As an arm is pulled more, its posterior distribution tightens around the empirical mean. Its sampled values become consistent, and it is only selected if its mean is truly high, leading to exploitation. This dynamic adjustment requires no external tuning of an exploration parameter (like ε in ε-greedy).

Asymptotic Optimality & Regret Bounds

Thompson Sampling provides strong theoretical performance guarantees. For the standard K-armed bandit problem, it achieves logarithmic expected cumulative regret under certain conditions, matching the asymptotic lower bound and making it asymptotically optimal.

Cumulative Regret: The total difference in reward between always pulling the optimal arm and the algorithm's choices. Thompson Sampling's regret grows as O(log T).
Frequentist Analysis: Modern analyses have proven these bounds without relying on a 'correct' Bayesian prior, showing robustness.
Contextual Bandits: Extensions to linear contextual bandits also achieve near-optimal regret bounds (O(√T) or O(log T) depending on assumptions), making it applicable to personalized recommendations and dynamic treatment assignment.

Computational & Implementation Simplicity

Despite its Bayesian foundation, Thompson Sampling is often computationally straightforward to implement and scale.

Conjugate Priors: Using conjugate prior-likelihood pairs (e.g., Beta-Bernoulli, Gaussian-Gaussian) allows the posterior to be updated with simple arithmetic upon observing a reward, requiring only O(1) computation per update.
Parallelization: Sampling and selection for each arm is independent, making it easy to parallelize across a large set of arms.
Comparison to UCB: Unlike Upper Confidence Bound (UCB) algorithms, which require calculating confidence intervals, Thompson Sampling only requires drawing random samples, which can be more efficient in complex models (e.g., deep neural network-based bandits where sampling from a posterior approximation is simpler than computing high-dimensional confidence sets).

Natural Extension to Complex Models

The sampling-based approach generalizes elegantly beyond simple bandits to complex, real-world problems.

Contextual Bandits with Deep Learning (Neural Thompson Sampling): The reward function is modeled by a deep neural network. A posterior over network weights is approximated (e.g., via bootstrapping, dropout as approximate Bayesian inference, or Laplace approximation). An action is chosen by sampling a network from this posterior and selecting the context-action pair with the highest predicted reward.
Non-Stationary Environments: Can adapt to changing reward distributions by using forgetting mechanisms like discounting old observations or using change-point models within the Bayesian update.
Hierarchical Models: Naturally incorporates shared structure between arms (e.g., in multi-task or meta-learning bandit settings) through hierarchical priors, allowing for efficient knowledge transfer.

Relationship to Other Bandit Algorithms

Thompson Sampling occupies a distinct point in the design space of bandit algorithms.

vs. ε-Greedy: ε-greedy explores randomly with a fixed probability ε, ignoring uncertainty. Thompson Sampling explores intelligently, proportionally to the probability an arm is optimal.
vs. Upper Confidence Bound (UCB): UCB is a deterministic, optimism-in-the-face-of-uncertainty principle. It selects the arm with the highest upper confidence bound. Thompson Sampling is randomized and directly samples from the posterior. The two are often viewed as dual perspectives (frequentist optimism vs. Bayesian probability matching) and can have similar regret bounds.
vs. Gittins Index: The Gittins Index provides an optimal solution for the discounted infinite-horizon Bayesian bandit problem but is often computationally intractable. Thompson Sampling is a simple, near-optimal heuristic for both discounted and finite-horizon settings.

EXPLORATION-EXPLOITATION TRADEOFF

Thompson Sampling vs. Other Bandit Algorithms

A comparison of heuristic strategies for solving the multi-armed bandit problem, focusing on their approach to balancing the exploration of uncertain options with the exploitation of known rewards.

Feature / Mechanism	Thompson Sampling (Bayesian Bandit)	ε-Greedy	Upper Confidence Bound (UCB)	Gradient Bandit Algorithm
Core Selection Principle	Sample from posterior, choose max sample	Choose random arm with probability ε, else choose best empirical mean	Choose arm maximizing empirical mean + confidence bound	Choose arm with probability from a softmax of preference scores
Underlying Framework	Bayesian probability (explicit posterior distributions)	Frequentist statistics (empirical averages)	Frequentist statistics with concentration inequalities	Stochastic gradient ascent on expected reward
Exploration Strategy	Probability matching via posterior sampling	Undirected, random exploration	Optimistic exploration of plausible best arms	Directed exploration via preference updates
Handles Delayed Feedback
Incorporates Prior Knowledge
Theoretical Regret Bounds	Asymptotically optimal for Bernoulli rewards	Linear regret if ε is fixed	Logarithmic problem-dependent regret	Sublinear regret with appropriate step-size
Computational Overhead	Medium (requires maintaining & sampling from posteriors)	Low (requires only mean estimates)	Low (requires mean estimates and visit counts)	Low (requires preference scores and a baseline)
Natural for Non-Stationary Environments

THOMPSON SAMPLING

Frequently Asked Questions

Thompson Sampling is a foundational algorithm for balancing exploration and exploitation, crucial for autonomous systems that must learn optimal actions through trial and error. These FAQs address its core mechanisms, applications, and relationship to broader concepts in agentic and recursive systems.

Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation dilemma in sequential decision-making problems, most classically the multi-armed bandit. It works by maintaining a posterior probability distribution over the expected reward for each possible action (or 'arm'). To select an action, the algorithm samples a single reward value from each action's posterior distribution and then executes the action with the highest sampled value. After observing the actual reward, it updates the posterior distribution for that action using Bayes' rule, refining its beliefs. This simple probability matching strategy naturally balances trying uncertain actions (exploration) and favoring actions believed to be best (exploitation).

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

EXPLORATION VS. EXPLOITATION

Related Terms

Thompson Sampling is a core technique for balancing exploration and exploitation in sequential decision-making. The following terms are foundational to understanding its context, alternatives, and extensions.

Multi-Armed Bandit Problem

The foundational sequential decision-making framework where an agent repeatedly chooses from a set of actions (arms), each providing a stochastic reward from an unknown distribution. The core tension is between exploitation (choosing the arm with the highest estimated reward) and exploration (trying other arms to gather more information). Thompson Sampling is a heuristic solution to this problem.

Bayesian Optimization

A global optimization strategy for expensive black-box functions. Like Thompson Sampling, it uses a probabilistic surrogate model (e.g., a Gaussian Process) to model the objective. It selects the next point to evaluate by optimizing an acquisition function, which explicitly balances exploration and exploitation. Common functions include Expected Improvement (EI) and Upper Confidence Bound (UCB).

Upper Confidence Bound (UCB)

A deterministic alternative to Thompson Sampling for the multi-armed bandit problem. The algorithm selects the arm with the highest upper confidence bound, calculated as the estimated mean reward plus an exploration bonus. The bonus (e.g., √(2 log t / n_i)) shrinks as an arm is pulled more often (n_i), ensuring all arms are sufficiently explored. It provides strong theoretical regret bounds.

Contextual Bandits

An extension of the classic bandit problem where, before each decision, the agent observes a context vector (e.g., user features, time of day). The goal is to learn a policy mapping contexts to actions to maximize reward. Thompson Sampling is readily extended to this setting using contextual models like Bayesian linear regression or neural networks with approximate posterior distributions.

Reinforcement Learning (RL)

A broader framework for sequential decision-making in environments with delayed rewards and state transitions. The multi-armed bandit is a stateless, single-step RL problem. Thompson Sampling's principle of posterior sampling extends to RL as Posterior Sampling for Reinforcement Learning (PSRL), where the agent samples a full Markov Decision Process (MDP) from a posterior and acts optimally within it for an episode.

Monte Carlo Methods

A class of computational algorithms that rely on repeated random sampling to obtain numerical results. Thompson Sampling is inherently a Monte Carlo method: it makes decisions by sampling from posterior distributions. This connects it to broader techniques like Monte Carlo Tree Search (MCTS), which uses random simulations to evaluate actions in complex decision trees.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.