Selection is the initial phase of a Monte Carlo Tree Search (MCTS) iteration where the algorithm traverses the existing search tree from the root node (current state) to a leaf node by recursively choosing child nodes according to a tree policy. The primary objective is to navigate the tree to a point where further expansion and simulation are beneficial, balancing the exploration-exploitation tradeoff to efficiently allocate computational resources. This phase is guided by node statistics, primarily visit count and cumulative reward, which are updated during backpropagation.
Glossary
Selection (MCTS Phase)

What is Selection (MCTS Phase)?
The first and defining phase of the Monte Carlo Tree Search algorithm where the existing search tree is traversed.
The canonical tree policy for selection is the Upper Confidence Bound for Trees (UCT) formula, which mathematically balances exploiting nodes with high average reward and exploring nodes with high uncertainty (few visits). Other policies include Progressive Widening for large action spaces. The selection phase concludes when the traversal reaches a node that is either not fully expanded (a leaf) or terminal, setting the stage for the subsequent expansion and simulation phases to gather new information from that point in the state space.
Key Components of the Selection Phase
The Selection phase is the initial traversal step in a Monte Carlo Tree Search iteration, where the algorithm navigates from the root to a leaf node using a tree policy to balance exploration and exploitation.
Tree Policy
The tree policy is the core decision function used during the Selection phase to choose which child node to traverse at each step. It defines the algorithm's strategy for navigating the existing search tree.
- The canonical policy is the Upper Confidence Bound for Trees (UCT), which mathematically balances the exploration-exploitation tradeoff.
- Other policies include UCB1, Thompson sampling, or domain-specific heuristics for problems like continuous action spaces.
- The policy's output determines the path from the root to a leaf node that has not been fully expanded.
Upper Confidence Bound for Trees (UCT)
Upper Confidence Bound for Trees (UCT) is the standard formula used as the tree policy in MCTS. It selects the child node j of parent node i that maximizes:
UCT(i, j) = Q_j / N_j + c * sqrt( ln(N_i) / N_j )
- Q_j / N_j: The exploitation term, representing the average reward (value) of child j.
- c * sqrt( ln(N_i) / N_j ): The exploration term, which encourages sampling less-visited nodes. The constant c controls the exploration weight.
- This formula ensures that promising nodes are revisited (exploitation) while under-sampled nodes are eventually tried (exploration), guaranteeing asymptotic convergence to the optimal action.
Node Statistics (Visit Count & Value)
Each node in the MCTS tree maintains key statistics that are read and updated during Selection and Backpropagation.
- Visit Count (N): The number of times the node has been traversed during the Selection phase. This is the denominator in the UCT formula and is crucial for guiding exploration.
- Cumulative Value (Q or W): The sum of all simulation results (rewards) that have been backpropagated through this node. The average value
Q/Nrepresents the node's estimated utility. - These statistics are stored in memory and are continuously refined as the search progresses, forming the algorithm's "knowledge" about the state space.
Traversal to a Leaf Node
The Selection phase's objective is to traverse from the root node (current state) to a leaf node. A leaf node is defined as either:
- A terminal node (a game-ending state).
- A non-terminal node that has not yet been expanded (i.e., not all its possible child actions have been added to the tree).
- The traversal is recursive: starting at the root, the tree policy selects one child, moves to it, and repeats until a leaf is reached.
- This process focuses computational effort on the most promising regions of the vast state space, as informed by the updated node statistics.
Exploration vs. Exploitation
The exploration-exploitation tradeoff is the fundamental dilemma managed by the Selection phase. The tree policy must decide between:
-
Exploitation: Choosing the child node with the highest average value (
Q/N). This refines the estimate of the currently best-known path. -
Exploration: Choosing a less-visited child node (low
N). This gathers new information, which may reveal a better path than the current best estimate. -
The UCT formula provides a principled, optimal balance. Early in search, exploration dominates. As visit counts grow, exploitation of high-value nodes becomes more prominent.
Handling Large Action Spaces
For environments with vast or continuous action spaces, naive Selection over all possible children is infeasible. Advanced techniques modify the phase:
- Progressive Widening: The number of child nodes considered for a parent is not fixed. It starts small and increases gradually as the parent's visit count
N_igrows (e.g.,K * sqrt(N_i)). New actions are dynamically added during Selection/Expansion. - Double Progressive Widening: Extends this concept to both action widening (adding new actions) and state widening (adding new outcome states for stochastic environments).
- These methods ensure the tree grows in a tractable, focused manner, prioritizing depth and evaluation of promising actions over breadth.
Frequently Asked Questions
The Selection phase is the critical first step in each Monte Carlo Tree Search (MCTS) iteration, where the algorithm navigates the existing tree to choose a promising leaf node for further exploration. These questions address its core mechanics, policies, and engineering considerations.
The Selection phase is the initial step in a Monte Carlo Tree Search (MCTS) iteration where the algorithm traverses from the root node (current state) to a leaf node by recursively selecting child nodes according to a tree policy, most commonly the Upper Confidence Bound for Trees (UCT) formula. This phase balances exploration of less-visited paths with exploitation of paths known to yield high rewards, efficiently guiding the search toward the most promising regions of the decision tree without exhaustive expansion. Its output is a leaf node where the next phase, Expansion, will occur.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The Selection phase is the first step in a Monte Carlo Tree Search iteration. It navigates the existing tree from the root to a leaf node by applying a tree policy, most commonly the Upper Confidence Bound for Trees (UCT) formula, to balance exploration and exploitation.
Upper Confidence Bound for Trees (UCT)
Upper Confidence Bound for Trees (UCT) is the canonical tree policy used during the Selection phase. It formalizes the exploration-exploitation tradeoff by selecting the child node that maximizes a score combining the node's average reward (exploitation) and an exploration bonus inversely proportional to its visit count. The formula is: UCT = Q/N + c * sqrt(ln(ParentVisits) / N), where Q is the total reward, N is the node's visit count, and c is an exploration constant.
Exploration-Exploitation Tradeoff
The exploration-exploitation tradeoff is the core dilemma addressed by the Selection policy. Exploitation means choosing the child node with the highest known average reward. Exploration means sampling less-visited nodes to gather more information and avoid local optima. An effective Selection phase, like one using UCT, mathematically balances these competing goals to asymptotically converge on the optimal action.
Visit Count
Visit count is a critical statistic stored in each node and updated during Backpropagation. During Selection, it serves two key functions:
- Informs Exploration: In the UCT formula, the visit count (
N) is in the denominator of the exploration term. A lowNincreases the exploration bonus, making an under-sampled node more attractive. - Guards Against Overfitting: High visit counts indicate a well-sampled node, giving confidence in its average reward estimate. The final action chosen from the root is typically the child with the highest visit count, not the highest immediate reward.
Tree Policy
A tree policy is the algorithm that decides how to traverse the tree during the Selection phase. While UCT is standard, other policies exist for different problem domains:
- Greedy Policy: Pure exploitation, always selects the child with the highest average reward. Prone to getting stuck in suboptimal paths.
- Epsilon-Greedy: Selects the greedy action most of the time, but with probability
epsilonchooses a random child. A simpler alternative to UCT. - Probability Matching: Selects children with probability proportional to their estimated value. The policy defines the intelligence of the tree traversal.
Progressive Widening
Progressive widening is a technique used to handle large or continuous action spaces within the Selection/Expansion framework. Instead of expanding all possible actions from a node at once, it gradually increases the number of considered child nodes as the parent's visit count grows. This keeps the tree manageable and focuses computational budget on promising areas of the action space first. It is often paired with a bandit algorithm at each node to select which new action to add next.
Virtual Loss
Virtual loss is a parallelization technique that modifies the Selection phase for multi-threaded MCTS. When a thread selects a node, it temporarily adds a "virtual loss" to that node's reward statistics. This artificially lowers its UCT score for the duration of its simulation, encouraging other threads to explore different parts of the tree and reducing wasteful contention. The virtual loss is removed once the thread completes its simulation and performs Backpropagation with the real result.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us