Simulation (rollout) is the phase in Monte Carlo Tree Search where, starting from a newly expanded node, a fast playout policy (often random or heuristic) is used to play the game or model the process until a terminal state is reached, generating a final outcome or reward. This computationally cheap playout provides a stochastic estimate of the long-term value of the node without a full, expensive search.
Glossary
Simulation (Rollout)

What is Simulation (Rollout)?
The simulation, or rollout, is a core phase of the Monte Carlo Tree Search (MCTS) algorithm responsible for estimating the value of a newly expanded game state.
The result of this simulated trajectory is then backpropagated up the tree to update the statistics of all ancestor nodes. While simple random playouts are common, more informed policies can drastically improve sample efficiency. This phase is distinct from the selection and expansion phases, focusing purely on outcome estimation rather than tree traversal or growth.
Key Characteristics of the Simulation Phase
The simulation (or rollout) phase is where Monte Carlo Tree Search estimates the value of a newly expanded node by playing out the scenario to a terminal state using a fast, often random, policy.
Objective: Value Estimation
The core purpose is to generate a sample outcome (win/loss, reward) from the expanded node's state. This stochastic estimate, averaged over many rollouts, approximates the true expected value of that position. It provides the data point for the subsequent backpropagation phase.
- Key Metric: The Monte Carlo estimate is unbiased but can have high variance, especially with random playouts.
- Analogy: Like running a fast, approximate simulation of a business strategy to gauge its potential success, rather than a full, costly analysis.
The Playout Policy
This is the algorithm that selects actions during the rollout. Its speed is critical, as thousands of simulations are run.
- Default Policy: Often a uniform random selection from legal actions. It's fast and requires no domain knowledge.
- Heuristic Policy: A simple, rule-based strategy (e.g., capture if possible in chess) that yields more informative outcomes than pure random, reducing variance.
- Learned Policy: In advanced systems like AlphaZero, a lightweight version of a neural network policy can guide simulations, drastically improving sample efficiency.
Terminal State & Reward
The simulation continues until a terminal state is reached, where the game or process concludes. The outcome is then scored.
- Deterministic Terminal: A clear win, loss, or draw condition (e.g., checkmate in chess).
- Stochastic Terminal: An outcome with a probabilistic score (e.g., final profit in a simulated supply chain).
- Reward Signal: The result is converted into a numerical reward (e.g., +1 for win, 0 for draw, -1 for loss). This scalar is what is propagated back through the tree.
Computational Trade-Off: Speed vs. Fidelity
The phase embodies a key engineering trade-off. Faster, simpler policies allow for more simulations within a fixed compute budget, improving the law of large numbers. However, they may produce noisy estimates.
- High Variance: Random playouts in complex games can produce misleading outcomes, requiring many more iterations for a reliable average.
- Bias-Variance Trade-off: Introducing a smart heuristic reduces variance but can introduce bias if the heuristic is suboptimal. The search may prematurely prune truly good lines the heuristic misjudges.
Parallelism & Virtual Loss
Simulations are embarrassingly parallel. Multiple rollouts from different leaf nodes can be run simultaneously. To efficiently parallelize a shared tree, the virtual loss technique is used.
- Mechanism: When a thread selects a node for a rollout, it temporarily adds a virtual loss to that node's statistics. This artificially lowers its estimated value, discouraging other threads from selecting the same path immediately.
- Benefit: It reduces thread contention, effectively diversifying exploration across the tree during parallel execution.
Enhancements: RAVE & Domain Knowledge
Basic random rollouts can be improved with techniques that share information across the tree.
- Rapid Action Value Estimation (RAVE): Shares statistics for an action across all nodes in the tree where it was taken, not just along a single path. This accelerates the convergence of value estimates, especially in Go.
- Domain-Specific Termination: In non-game applications (e.g., logistics planning), a simulation may terminate after a fixed horizon or when a key milestone is reached, with a heuristic evaluation of the intermediate state.
How the Simulation (Rollout) Phase Works
The simulation phase, also called a rollout, is the third step in a Monte Carlo Tree Search (MCTS) iteration where a fast policy is used to play out the scenario from a newly expanded node to a terminal state.
In the simulation (rollout) phase, the algorithm uses a playout policy—typically a fast, lightweight heuristic like random uniform selection—to sample a possible future from the newly expanded node until reaching a terminal state (e.g., game end). This default policy generates a final outcome, such as a win/loss or a scalar reward, without the computational cost of further tree expansion. The result provides a stochastic estimate of the node's value.
The rollout is a Monte Carlo sampling technique that estimates the long-term value of a state through repeated random forward play. While simple, this approach allows MCTS to evaluate positions in vast state spaces where exhaustive search is impossible. The efficiency of the rollout policy is critical, as it is executed thousands of times per search. More informed policies can reduce variance and improve convergence speed.
Frequently Asked Questions
A simulation, also called a rollout or playout, is the core estimation phase within the Monte Carlo Tree Search (MCTS) algorithm. This section answers key technical questions about its function, design, and optimization.
A simulation (rollout) is the phase in the Monte Carlo Tree Search (MCTS) algorithm where, starting from a newly expanded leaf node, a fast playout policy is used to sample a sequence of actions until a terminal state (e.g., game end) is reached, generating a final outcome or reward used to estimate the node's value.
This process is a form of Monte Carlo estimation, using random sampling to approximate the expected utility of a state when an exact calculation is computationally infeasible. The result—often a win/loss/draw signal or a numerical score—is then backpropagated up the tree to update the statistics of all ancestor nodes, refining the algorithm's understanding of which paths are promising.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Simulation (Rollout) is one core phase of the Monte Carlo Tree Search loop. These cards detail the other phases, key policies, and advanced algorithmic enhancements that interact with the rollout process.
Playout Policy
The playout policy is the specific strategy used to conduct the simulation from a leaf node to a terminal state. Its primary goal is speed, not optimality.
- Default Policy: Often a fast, random uniform selection of actions.
- Heuristic Policy: Uses a simple, domain-specific rule to bias actions for more informative outcomes (e.g., avoiding obviously bad moves in Go).
- Lightweight Network: In advanced systems like AlphaZero, a rapid, small neural network may guide playouts.
The choice of policy directly impacts the variance and bias of the value estimates propagated back through the tree.
Upper Confidence Bound for Trees (UCT)
Upper Confidence Bound for Trees (UCT) is the canonical formula used during the Selection phase to navigate the tree. It mathematically balances the exploration-exploitation tradeoff.
The formula for a child node i is: UCT(i) = Q(i)/N(i) + c * sqrt(ln(N(parent)) / N(i))
- Q(i)/N(i): The average reward (exploitation term).
- sqrt(...): The exploration bonus, favoring less-visited nodes.
- c: A tunable constant controlling exploration weight.
UCT ensures the simulation budget is allocated efficiently, guiding rollouts toward promising but underexplored regions of the state space.
Backpropagation (MCTS Phase)
Backpropagation is the final phase where the result of the simulation (rollout) is propagated back up the tree to update node statistics.
- Updated Statistics: The visit count (N) and cumulative reward (Q) of every node along the path from the expanded leaf back to the root are incremented.
- Value Averaging: The node's estimated value becomes
Q/N, the mean reward from all simulations passing through it. - Information Flow: This phase is what makes MCTS an asymmetric search, progressively focusing the tree on the most valuable lines of play based on rollout outcomes.
Rapid Action Value Estimation (RAVE)
Rapid Action Value Estimation (RAVE) is an enhancement that accelerates value convergence, especially in games like Go. It shares simulation statistics across the entire tree, not just a single path.
- All-Moves-As-First (AMAF): RAVE tracks, for a given node, the average result of all simulations in its subtree where a specific action was taken at any point, regardless of when.
- Blended Value: A node's final value estimate is a weighted blend of its standard MCTS value (
Q/N) and its RAVE value, with the weight shifting to the standard value as visits increase. - Effect: Drastically reduces the number of simulations needed to identify good moves, making the rollout phase more informative early in search.
Progressive Widening
Progressive Widening is a technique to handle large or continuous action spaces where enumerating all child nodes during Expansion is intractable.
- Core Mechanism: The number of child actions considered for a node is not fixed. It grows slowly as a function of that node's visit count (e.g.,
k * N(node)^α). - Two-Stage Process: 1) Decide if to expand a new action (based on visit count). 2) Decide which new action to add (often via a heuristic or random sampling).
- Relation to Rollout: It ensures the tree expands in a tractable manner, focusing rollout resources on a gradually expanding set of promising actions rather than an impossibly large set.
Virtual Loss
Virtual Loss is a parallelization technique for MCTS that enables multiple threads to efficiently share a single search tree without excessive contention.
- Mechanism: When a thread selects a node during the Selection phase, it temporarily adds a 'virtual loss' (e.g., -1) to that node's reward sum
Q. This artificially lowers its UCT score, making other threads less likely to select the same path immediately. - Reset: After the thread completes its simulation (rollout) and backpropagation, it removes the virtual loss and adds the true result.
- Benefit: Encourages parallel threads to explore different parts of the tree, increasing the diversity of rollouts and overall search efficiency.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us