Thompson Sampling is a Bayesian heuristic for solving the exploration-exploitation dilemma in the multi-armed bandit problem. An agent selects an action by first sampling a potential reward value from the posterior probability distribution maintained for each available arm (or action) and then choosing the arm that yielded the highest sampled value. This stochastic mechanism naturally balances trying uncertain options (exploration) and favoring options believed to be best (exploitation).
