Dirichlet noise is a random perturbation sampled from a Dirichlet distribution and added to the prior probabilities at the root node of a Monte Carlo Tree Search. This injection of stochasticity, as used in systems like AlphaZero, encourages the search algorithm to explore a broader range of initial actions during the early phases of self-play games, preventing premature convergence on a narrow set of moves. It directly addresses the exploration-exploitation tradeoff by ensuring the root's children are visited even if their initial neural network priors are low.
Glossary
Dirichlet Noise

What is Dirichlet Noise?
A technique for enhancing exploration in neural Monte Carlo Tree Search (MCTS) during self-play training.
The noise is parameterized by a concentration parameter (often denoted α) and a mixing coefficient (ε). A typical implementation adds a small fraction of Dirichlet noise to the raw policy network outputs before they are normalized into action priors. This technique is crucial for discovering novel strategies in environments with sparse rewards, as it provides a structured form of random exploration that decays naturally as the visit counts increase and the value estimates become more reliable through repeated simulations.
Key Characteristics of Dirichlet Noise
Dirichlet noise is a stochastic perturbation applied to the prior action probabilities at the root node of a Monte Carlo Tree Search (MCTS) to ensure sufficient exploration during self-play training, particularly in algorithms like AlphaZero.
Mathematical Foundation
Dirichlet noise is sampled from the Dirichlet distribution, a multivariate generalization of the Beta distribution. It is defined by a concentration parameter vector (α). In the context of MCTS, a symmetric Dirichlet distribution is typically used, where α is a scalar, often set between 0.03 and 0.3 for games like Go. This distribution ensures that the sampled noise vector sums to 1, making it a natural choice for perturbing a probability distribution over actions.
- Parameter α: Controls the concentration of the distribution. A smaller α (e.g., 0.03) results in a sparse, peaky noise vector, favoring a few actions strongly. A larger α results in a more uniform noise vector.
- Noise Injection: The sampled Dirichlet noise vector is mixed with the neural network's prior policy output (P) at the root node: P' = (1 - ε) * P + ε * η, where η ~ Dir(α) and ε is a small mixing parameter (e.g., 0.25).
Primary Role in Exploration
The core function of Dirichlet noise is to encourage exploration in the early moves of self-play games. Without it, the MCTS guided by a neural network's policy can rapidly converge to a narrow set of high-probability moves, leading to mode collapse where the agent only explores a limited subset of the state space.
- Counteracts Early Convergence: By adding stochasticity to the root node's priors, the search is forced to consider a wider variety of initial actions, even those deemed less likely by the current policy network.
- Discovering Novel Strategies: This forced exploration is crucial for discovering new, potentially superior strategies that the current policy model undervalues, driving the iterative improvement cycle in self-play reinforcement learning.
Application in AlphaZero
Dirichlet noise was a critical innovation in the AlphaZero algorithm. It is applied only at the root node of the MCTS during the self-play data generation phase.
- Root-Node Specific: Noise is not added during inference (competitive play) or to child nodes during search. This ensures exploratory data generation while retaining deterministic, exploitative play for evaluation.
- Combination with PUCT: The noise-perturbed prior probabilities (P') are used in the Polynomial Upper Confidence bound for Trees (PUCT) formula during the selection phase: U(s, a) = c_puct * P'(s, a) * (√N(s) / (1 + N(s, a))). This directly biases exploration towards the noisy prior.
- Empirical Values: In AlphaZero's Go play, typical parameters were α = 0.03 (for the 19x19 board) and ε = 0.25.
Distinction from Other Noise Types
Dirichlet noise is uniquely suited for this task compared to other common noise types:
- vs. Gaussian Noise: Gaussian noise can produce negative values or cause the probability vector to not sum to 1, requiring renormalization. Dirichlet noise inherently generates a valid probability simplex.
- vs. Epsilon-Greedy: Epsilon-greedy makes a completely random choice with probability ε. Dirichlet noise provides a structured, yet stochastic, bias that still respects the relative strengths suggested by the policy network, just with added variance.
- vs. Temperature Scaling: Temperature scaling adjusts the entropy of the final action distribution after the search is complete. Dirichlet noise affects the search guidance itself by altering the priors at the start of the tree traversal.
Hyperparameter Sensitivity
The effectiveness of Dirichlet noise is sensitive to its two main hyperparameters: the concentration parameter (α) and the mixing coefficient (ε).
- Concentration Parameter (α): Must be scaled appropriately for the action space size. For a game with
nlegal moves at the root, a common heuristic is to use α = 1/n or a small constant. Too high an α makes the noise too uniform, diluting the policy's guidance. Too low makes it too sparse, causing excessive focus on a few random actions. - Mixing Coefficient (ε): Balances the trust between the learned policy and the exploration noise. A value around 0.1 to 0.3 is typical. It often anneals over time in training schedules.
- Interaction with Visit Count: The influence of the noisy prior diminishes as a root node action is visited more, because the UCT formula's exploration term shrinks relative to the empirically derived value estimate.
Related Concepts in Search
Dirichlet noise interacts with several other MCTS and reinforcement learning mechanisms:
- Exploration-Exploitation Tradeoff: It is a direct, tunable mechanism for controlling this tradeoff specifically at the root of the search tree.
- Progressive Widening: In continuous or vast action spaces, Dirichlet noise at the root complements techniques like progressive widening, which gradually expands the number of child nodes considered.
- Self-Play Training Loop: The noise is a key component ensuring diversity in the training data generated by self-play, preventing the agent from becoming stuck in a local optimum by playing against a slightly varied version of itself.
- Policy Improvement Theorem: The noise facilitates exploration necessary for the policy iteration cycle to converge to an optimal policy, as it ensures all actions have a non-zero probability of being evaluated.
How Dirichlet Noise Works in Neural MCTS
Dirichlet noise is a mathematical technique for injecting structured randomness into the root node of a search tree to prevent overfitting during self-play training.
Dirichlet noise is a form of random perturbation added to the prior action probabilities output by a neural network at the root node of a Monte Carlo Tree Search (MCTS). In architectures like AlphaZero, this noise encourages exploration in the early moves of self-play games by ensuring a non-zero probability of sampling all legal actions, preventing the search from prematurely converging on a narrow, suboptimal policy. The noise is sampled from a Dirichlet distribution, typically with a concentration parameter favoring a sparse, exploratory prior.
The technique directly addresses the exploration-exploitation tradeoff in the selection phase of MCTS. By blending the network's prior policy with Dirichlet noise, the Upper Confidence Bound for Trees (UCT) formula is biased to try novel actions early, fostering diverse game trajectories. This diversity is critical for generating robust training data, allowing the policy network and value network to learn a more general and strategic understanding of the environment rather than a brittle, self-reinforcing line of play.
Frequently Asked Questions
Dirichlet noise is a specialized technique used to enhance exploration in neural Monte Carlo Tree Search algorithms. Below are answers to common technical questions about its purpose, mechanics, and implementation.
Dirichlet noise is a form of random perturbation sampled from a Dirichlet distribution and added to the prior action probabilities at the root node of a Monte Carlo Tree Search (MCTS) tree. It works by injecting structured randomness during the selection phase to encourage the exploration of a wider variety of initial moves, especially in the early stages of self-play or when facing novel positions. The noise vector is scaled by a parameter (eta) and interpolated with the original prior probabilities from a neural network's policy head, ensuring that while exploration is boosted, the search is still guided by the network's learned knowledge. This mechanism directly combats search bias and prevents the algorithm from prematurely converging on a narrow set of high-value actions.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Dirichlet noise is a key component in modern neural MCTS systems. Understanding its related concepts is essential for engineers implementing robust exploration strategies in planning and game-playing agents.
Exploration-Exploitation Tradeoff
The fundamental dilemma in sequential decision-making of whether to sample new, uncertain actions (exploration) or to favor actions known to yield high rewards (exploitation). In MCTS, this is mathematically managed by selection policies like Upper Confidence Bound for Trees (UCT). Dirichlet noise directly injects stochasticity to bias this tradeoff towards exploration, especially in the early, uncertain phases of self-play training.
Neural Monte Carlo Tree Search
A hybrid architecture that integrates deep neural networks with classical MCTS. It typically uses two networks: a policy network to suggest promising actions during tree expansion and a value network to evaluate states, replacing random rollouts. Dirichlet noise is applied to the prior probabilities output by the policy network at the root node, ensuring the neural guidance does not prematurely collapse exploration.
AlphaZero Algorithm
A seminal self-play reinforcement learning system that combines a deep residual neural network with Monte Carlo Tree Search. It achieves superhuman performance in games like chess, shogi, and Go. Dirichlet noise was a critical innovation in AlphaZero, added to the root node's prior probabilities during training to ensure sufficient exploration of opening moves, preventing the agent from overfitting to a narrow set of early-game strategies.
Prior Probability
In neural MCTS, the prior probability (P(s, a)) is an estimate, provided by a trained policy network, of the likelihood that action a is optimal from state s. This prior guides the initial search. Dirichlet noise perturbs these priors at the root node: P'(s, a) = (1 - ε) * P(s, a) + ε * η, where η is sampled from a Dirichlet distribution and ε is a small mixing parameter. This ensures the search tests a wider range of initial actions.
Progressive Widening
A technique used in MCTS for environments with large or continuous action spaces. Instead of expanding all possible child nodes at once, the algorithm gradually increases the number of considered actions for a parent node as its visit count grows. Dirichlet noise addresses a related but distinct problem: even with a small, discrete action set (like Go's 19x19 board), a strong policy network can assign near-zero prior to valid but novel moves, which progressive widening alone cannot fix.
Self-Play Training
A reinforcement learning paradigm where an agent improves by playing games against iterations of itself. This is the core training loop of AlphaZero and MuZero. Dirichlet noise is exclusively applied during these self-play games to generate diverse, exploratory training data. It is typically not used during competitive evaluation or real-world deployment, where the agent should exploit its learned knowledge deterministically.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us