AlphaZero is a model-based reinforcement learning system that combines a deep neural network with a Monte Carlo Tree Search planner. The neural network, trained through self-play, provides both a policy (probability distribution over moves) and a value estimation (predicted game outcome) to guide the tree search. This tight integration allows it to evaluate millions of positions efficiently, focusing its search on the most promising branches identified by its learned model.
Glossary
AlphaZero

What is AlphaZero?
AlphaZero is a groundbreaking reinforcement learning algorithm that achieved superhuman performance in chess, shogi, and Go through self-play, without using human game data or domain-specific knowledge beyond the basic rules.
The algorithm's core innovation is its ability to learn tabula rasa, mastering complex games through self-play and discovering novel strategies. It operates by playing games against itself, using the results to iteratively update its neural network parameters via reinforcement learning. This process creates a powerful exploration-exploitation tradeoff, where the MCTS explores possible futures while the network exploits learned patterns, enabling it to solve long-horizon planning problems with a massive state space and high branching factor.
Key Technical Features
AlphaZero is a reinforcement learning algorithm that masters complex games through self-play, combining a deep neural network with Monte Carlo Tree Search for superhuman performance without human data.
Self-Play Reinforcement Learning
AlphaZero learns entirely from self-play, starting with random moves. It treats the game as a Markov Decision Process, where the agent (the current player) interacts with the environment (the game board) by selecting actions (moves). Through millions of self-play games, it uses the REINFORCE algorithm to update its neural network parameters, maximizing the expected cumulative reward (winning the game). This creates a closed learning loop where the agent continuously improves by playing against progressively stronger versions of itself, without any prior human knowledge or labeled datasets.
Monte Carlo Tree Search (MCTS) Planner
AlphaZero uses a heuristic search algorithm to evaluate positions and select moves during both training and play. The MCTS builds a search tree incrementally through four phases:
- Selection: Traverse the tree from the root using the PUCT (Polynomial Upper Confidence Trees) formula, balancing explored nodes with high value (exploitation) and less-visited nodes (exploration).
- Expansion: Add a new child node to the tree when reaching a leaf node.
- Simulation (Rollout): Perform a fast rollout to a terminal game state using the neural network's policy head to sample moves, not random play.
- Backpropagation: Update the visit count and value estimate of all nodes along the traversed path with the game's outcome (+1 for win, -1 for loss, 0 for draw). This process generates improved policy targets for training, as MCTS explores more deeply than the raw network output.
Unified Deep Neural Network
A single, deep convolutional neural network with a dual-headed architecture serves as AlphaZero's core evaluator and guide.
- Input: A stack of 8-17 planar feature maps representing the current board state and recent history.
- Body: A residual tower of convolutional blocks (using batch normalization and ReLU activations) that extracts spatial features.
- Policy Head (p): Outputs a probability distribution over all legal moves, guiding the MCTS search.
- Value Head (v): Outputs a scalar estimate of the expected game outcome from the current player's perspective (win/loss/draw).
The network is trained to minimize a combined loss function:
(z - v)^2 - π^T log p + c ||θ||^2, wherezis the actual game outcome,πis the improved search-based policy, andcis an L2 regularization constant. This unified model replaces the separate policy and value networks used in its predecessor, AlphaGo Zero.
General Game-Playing Architecture
AlphaZero's architecture is game-agnostic. The same algorithm, network topology, and hyperparameters (with minor tuning) mastered chess, shogi, and Go. Only the game rules are provided as input to define:
- The legal move generator.
- The game termination condition.
- The outcome scalar (win/loss/draw). The neural network's input planes are configured to represent game-specific state information (e.g., piece types and repetition counters for chess). This demonstrates a general-purpose reinforcement learning and planning framework applicable to any deterministic, perfect-information, two-player zero-sum game with a discrete action space.
Asynchronous Distributed Training
Training was accelerated using a massive, distributed computing infrastructure. The system consisted of thousands of TPUs (Tensor Processing Units) for neural network inference and hundreds of GPUs for experience generation.
- Self-Play Actors: Many parallel actors played games, using the latest network checkpoint to generate training data (state, MCTS-improved policy π, outcome z).
- Training Learner: A single learner process continuously sampled batches of experience from a replay buffer, computed gradients, and updated the neural network weights.
- Checkpointing: Updated network parameters were periodically pushed to the self-play actors. This asynchronous pipeline enabled the generation of hundreds of millions of self-play games and efficient, stable training over several days.
Sample Efficiency & Computational Cost
Despite its ultimate superhuman performance, AlphaZero is sample-inefficient by modern deep RL standards but computationally efficient in terms of environmental complexity.
- Self-Play Games: It required ~44 million games of self-play over 9 hours (700,000 steps) to surpass AlphaGo Zero in Go, and ~1 million games to beat Stockfish in chess.
- No Domain Knowledge: It achieved this without handcrafted features, opening books, or endgame tables, learning complex concepts like king safety and pawn structure from scratch.
- Compute Scale: The training utilized ~5,000 first-generation TPUs and ~64 GPUs. However, the final evaluated model required only 4 TPUs for MCTS, demonstrating that the immense computational cost is front-loaded in training, not inference.
How AlphaZero Works: The Self-Play Training Loop
AlphaZero's core innovation is a self-contained reinforcement learning loop where the algorithm plays millions of games against itself to iteratively improve its neural network and search policy.
AlphaZero is a model-based reinforcement learning algorithm that masters games through self-play. It combines a deep neural network—which outputs both a move policy and a state value—with a Monte Carlo Tree Search planner. The network is trained exclusively on data generated from games it plays against itself, starting from random play. This creates a closed-loop system where improved network predictions lead to better game play, which in turn generates higher-quality training data.
The training loop operates in parallel iterations. In each iteration, the current best network plays many games via MCTS, where the search is guided by the network's own predictions. The resulting game records, containing board states and final outcomes, form a training dataset. A new network is then trained via supervised learning to mimic the MCTS-improved policy and predict the game winner. This new network is evaluated against the previous champion; if it wins a majority of games, it becomes the new best player. This cycle of self-play, data generation, and policy iteration continues until superhuman performance is achieved.
Frequently Asked Questions
AlphaZero is a landmark reinforcement learning algorithm developed by DeepMind that achieved superhuman performance in chess, shogi, and Go through self-play alone. This FAQ addresses its core mechanisms, technical innovations, and its broader impact on AI research.
AlphaZero is a reinforcement learning algorithm that masters complex board games through self-play, using a combination of a deep neural network and Monte Carlo Tree Search (MCTS). Starting with only the basic rules of a game, it plays millions of games against itself. A single neural network, with separate policy and value heads, guides the MCTS. The network predicts the probability distribution over moves (the policy) and the expected game outcome from a given position (the value). MCTS uses these predictions to perform a highly focused, look-ahead search, simulating games from the current state. The algorithm then updates the neural network's weights to make its predictions match the improved search results, creating a virtuous cycle of learning.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
AlphaZero's breakthrough performance is built upon a synthesis of advanced concepts from reinforcement learning, search, and neural network architecture. Understanding these related terms is essential for engineers and researchers aiming to replicate or extend its capabilities.
Monte Carlo Tree Search (MCTS)
Monte Carlo Tree Search is the core planning algorithm used by AlphaZero. It is a heuristic search algorithm for optimal decision-making in sequential problems. MCTS builds a search tree incrementally by performing four key steps in a loop:
- Selection: Traverse the tree from the root using a selection policy (like UCT) to pick a child node.
- Expansion: Add one or more child nodes to the tree, expanding the frontier.
- Simulation (Rollout): Play out a simulated game from the new node(s) to a terminal state using a fast, default policy.
- Backpropagation: Update the statistics (visit count, value estimate) of all nodes along the traversed path with the simulation's result. AlphaZero's innovation was to replace the random rollout policy with a deep neural network's value and policy outputs, making simulations vastly more informative.
Upper Confidence Bound for Trees (UCT)
Upper Confidence Bound for Trees is the canonical selection formula used within MCTS to balance exploration and exploitation. For a given node, it selects the child j that maximizes:
UCT(j) = Q(j) + c * sqrt( ln(N(parent)) / N(j) )
Where:
- Q(j) is the average reward (exploitation term).
- N(j) is the visit count for child j.
- N(parent) is the visit count for the parent node.
- c is an exploration constant. This formula ensures that promising nodes (high Q) are exploited, while less-visited nodes (low N) are explored. AlphaZero uses a variant of UCT where the prior probability from its neural network's policy head guides the exploration, heavily biasing the search toward moves the network considers strong.
Self-Play Reinforcement Learning
Self-play is the training paradigm where an AI agent learns by competing against versions of itself, without any human data or intervention. AlphaZero's training loop consists of:
- Iterative Improvement: The current best neural network plays thousands of games against a slightly older version of itself.
- Data Generation: Each game produces a sequence of board positions, each labeled with the eventual game outcome (win/loss/draw) and the search-derived policy (visit counts from MCTS).
- Network Training: The neural network is trained via supervised learning to predict both the game outcome (value) and the search probabilities (policy) for each position.
- Iteration: The newly trained network becomes the next opponent. This creates a positive feedback loop, where the agent's policy and value estimates improve, which in turn improves the quality of the MCTS search, generating better training data.
Residual Neural Network (ResNet)
AlphaZero uses a deep Residual Neural Network as its core architecture to process the game board. The ResNet is critical for its ability to train very deep networks effectively. Key features include:
- Residual Blocks: These blocks use skip connections that allow gradients to flow directly through the network, mitigating the vanishing gradient problem and enabling the training of networks with 40+ layers.
- Dual-Headed Output: The network has two output heads:
- A policy head that outputs a probability distribution over all possible moves.
- A value head that outputs a scalar estimating the expected game outcome from the current position (win probability).
- Input Representation: The network takes a stack of several recent board positions as input, providing a sense of game history. This architecture allows AlphaZero to evaluate positions with both strategic understanding (value) and tactical precision (policy).
Model-Based Reinforcement Learning
AlphaZero is a quintessential example of a model-based reinforcement learning system. Unlike model-free RL (e.g., Q-learning), which learns a policy or value function directly from experience, model-based RL agents learn an internal model of the environment—in this case, the rules of the game—and use it for planning.
- Learned Model: AlphaZero's neural network implicitly encodes a dynamic model. The policy head predicts plausible actions, and the value head predicts outcomes, together forming a predictive model of state transitions and rewards.
- Planning with the Model: MCTS uses this learned model to perform lookahead search, simulating thousands of possible future game trajectories to evaluate the current position. This combination allows for extremely sample-efficient learning, as the agent can reason about consequences without experiencing every possible state directly.
AlphaGo & AlphaGo Zero
AlphaZero is the direct successor to AlphaGo and AlphaGo Zero, representing the evolution of DeepMind's game-playing AI.
- AlphaGo (2016): Combined MCTS with supervised learning on human expert games and policy/value networks. It defeated world champion Lee Sedol.
- AlphaGo Zero (2017): Removed the dependency on human data, learning solely through self-play. It used a single neural network and defeated the original AlphaGo 100-0.
- AlphaZero (2017): Generalized the AlphaGo Zero algorithm to master multiple games—Chess, Shogi, and Go—using the same network architecture and hyperparameters (except input representation). It demonstrated the algorithm's domain-agnostic nature, achieving superhuman performance in all three games within 24 hours of training, starting from random play.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us