Inferensys

Glossary

Simulation (Rollout)

Simulation, or rollout, is the Monte Carlo Tree Search (MCTS) phase where a fast playout policy is used to play from a node to a terminal state, generating a reward estimate.
Developer reviewing semantic search engine results on laptop, relevance scores visible, technical search demo.
MCTS PHASE

What is Simulation (Rollout)?

The simulation, or rollout, is a core phase of the Monte Carlo Tree Search (MCTS) algorithm responsible for estimating the value of a newly expanded game state.

Simulation (rollout) is the phase in Monte Carlo Tree Search where, starting from a newly expanded node, a fast playout policy (often random or heuristic) is used to play the game or model the process until a terminal state is reached, generating a final outcome or reward. This computationally cheap playout provides a stochastic estimate of the long-term value of the node without a full, expensive search.

The result of this simulated trajectory is then backpropagated up the tree to update the statistics of all ancestor nodes. While simple random playouts are common, more informed policies can drastically improve sample efficiency. This phase is distinct from the selection and expansion phases, focusing purely on outcome estimation rather than tree traversal or growth.

MONTE CARLO TREE SEARCH

Key Characteristics of the Simulation Phase

The simulation (or rollout) phase is where Monte Carlo Tree Search estimates the value of a newly expanded node by playing out the scenario to a terminal state using a fast, often random, policy.

01

Objective: Value Estimation

The core purpose is to generate a sample outcome (win/loss, reward) from the expanded node's state. This stochastic estimate, averaged over many rollouts, approximates the true expected value of that position. It provides the data point for the subsequent backpropagation phase.

  • Key Metric: The Monte Carlo estimate is unbiased but can have high variance, especially with random playouts.
  • Analogy: Like running a fast, approximate simulation of a business strategy to gauge its potential success, rather than a full, costly analysis.
02

The Playout Policy

This is the algorithm that selects actions during the rollout. Its speed is critical, as thousands of simulations are run.

  • Default Policy: Often a uniform random selection from legal actions. It's fast and requires no domain knowledge.
  • Heuristic Policy: A simple, rule-based strategy (e.g., capture if possible in chess) that yields more informative outcomes than pure random, reducing variance.
  • Learned Policy: In advanced systems like AlphaZero, a lightweight version of a neural network policy can guide simulations, drastically improving sample efficiency.
03

Terminal State & Reward

The simulation continues until a terminal state is reached, where the game or process concludes. The outcome is then scored.

  • Deterministic Terminal: A clear win, loss, or draw condition (e.g., checkmate in chess).
  • Stochastic Terminal: An outcome with a probabilistic score (e.g., final profit in a simulated supply chain).
  • Reward Signal: The result is converted into a numerical reward (e.g., +1 for win, 0 for draw, -1 for loss). This scalar is what is propagated back through the tree.
04

Computational Trade-Off: Speed vs. Fidelity

The phase embodies a key engineering trade-off. Faster, simpler policies allow for more simulations within a fixed compute budget, improving the law of large numbers. However, they may produce noisy estimates.

  • High Variance: Random playouts in complex games can produce misleading outcomes, requiring many more iterations for a reliable average.
  • Bias-Variance Trade-off: Introducing a smart heuristic reduces variance but can introduce bias if the heuristic is suboptimal. The search may prematurely prune truly good lines the heuristic misjudges.
05

Parallelism & Virtual Loss

Simulations are embarrassingly parallel. Multiple rollouts from different leaf nodes can be run simultaneously. To efficiently parallelize a shared tree, the virtual loss technique is used.

  • Mechanism: When a thread selects a node for a rollout, it temporarily adds a virtual loss to that node's statistics. This artificially lowers its estimated value, discouraging other threads from selecting the same path immediately.
  • Benefit: It reduces thread contention, effectively diversifying exploration across the tree during parallel execution.
06

Enhancements: RAVE & Domain Knowledge

Basic random rollouts can be improved with techniques that share information across the tree.

  • Rapid Action Value Estimation (RAVE): Shares statistics for an action across all nodes in the tree where it was taken, not just along a single path. This accelerates the convergence of value estimates, especially in Go.
  • Domain-Specific Termination: In non-game applications (e.g., logistics planning), a simulation may terminate after a fixed horizon or when a key milestone is reached, with a heuristic evaluation of the intermediate state.
MCTS PHASE

How the Simulation (Rollout) Phase Works

The simulation phase, also called a rollout, is the third step in a Monte Carlo Tree Search (MCTS) iteration where a fast policy is used to play out the scenario from a newly expanded node to a terminal state.

In the simulation (rollout) phase, the algorithm uses a playout policy—typically a fast, lightweight heuristic like random uniform selection—to sample a possible future from the newly expanded node until reaching a terminal state (e.g., game end). This default policy generates a final outcome, such as a win/loss or a scalar reward, without the computational cost of further tree expansion. The result provides a stochastic estimate of the node's value.

The rollout is a Monte Carlo sampling technique that estimates the long-term value of a state through repeated random forward play. While simple, this approach allows MCTS to evaluate positions in vast state spaces where exhaustive search is impossible. The efficiency of the rollout policy is critical, as it is executed thousands of times per search. More informed policies can reduce variance and improve convergence speed.

SIMULATION (ROLLOUT)

Frequently Asked Questions

A simulation, also called a rollout or playout, is the core estimation phase within the Monte Carlo Tree Search (MCTS) algorithm. This section answers key technical questions about its function, design, and optimization.

A simulation (rollout) is the phase in the Monte Carlo Tree Search (MCTS) algorithm where, starting from a newly expanded leaf node, a fast playout policy is used to sample a sequence of actions until a terminal state (e.g., game end) is reached, generating a final outcome or reward used to estimate the node's value.

This process is a form of Monte Carlo estimation, using random sampling to approximate the expected utility of a state when an exact calculation is computationally infeasible. The result—often a win/loss/draw signal or a numerical score—is then backpropagated up the tree to update the statistics of all ancestor nodes, refining the algorithm's understanding of which paths are promising.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.