The Bellman equation is a recursive decomposition that defines the value of a state or state-action pair as the sum of the immediate reward and the discounted value of the successor state. Formulated by Richard Bellman, it expresses the principle of optimality: an optimal policy has the property that whatever the initial state and decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This recursion enables the computation of value functions through iterative methods like dynamic programming and temporal difference learning.
Glossary
Bellman Equation

What is the Bellman Equation?
The Bellman equation is the foundational recursive formula for optimal decision-making in sequential problems, central to reinforcement learning and dynamic programming.
In reinforcement learning, the equation provides the update rule for algorithms like Q-learning and policy iteration. The discount factor (gamma) balances immediate versus future rewards, while the expectation accounts for environmental stochasticity. Solving the Bellman equation is equivalent to finding an optimal policy, making it the theoretical cornerstone for credit assignment and long-term planning in autonomous agents. Its recursive nature directly enables the feedback loops that allow agents to evaluate and correct their future action sequences.
Key Forms of the Bellman Equation
The Bellman equation is not a single formula but a family of recursive relationships that decompose the long-term value of a state or state-action pair. These forms are foundational to different classes of reinforcement learning algorithms.
Bellman Expectation Equation
This form calculates the expected value of following a specific policy π. It expresses the value of a state V(s) as the immediate reward plus the discounted expected value of the next state.
- Formula for State-Value Function: V^π(s) = E_π[ R_t + γV^π(S_{t+1}) | S_t = s ]
- Formula for Action-Value Function: Q^π(s, a) = E_π[ R_t + γQ^π(S_{t+1}, A_{t+1}) | S_t = s, A_t = a ]
- Purpose: Used for policy evaluation, answering "How good is it to be in this state if I follow policy π?" It is the cornerstone of Dynamic Programming methods like Policy Iteration.
Bellman Optimality Equation
This form defines the optimal value functions, V*(s) and Q*(s,a). It assumes the agent selects the action that maximizes future returns at every step.
- Formula for Optimal State-Value: V*(s) = max_a E[ R_t + γV*(S_{t+1}) | S_t = s, A_t = a ]
- Formula for Optimal Action-Value (Q)**: Q(s, a) = E[ R_t + γ max_{a'} Q*(S_{t+1}, a') | S_t = s, A_t = a ]
- Purpose: Provides a recursive definition of optimality. Solving these equations yields an optimal policy. It is the foundation for value-based methods like Q-Learning and Value Iteration.
Bellman Equation for Q-Learning
This is the specific update rule derived from the Bellman Optimality Equation for the Q-function. It is the core of the model-free Q-Learning algorithm.
- Update Rule: Q(S_t, A_t) ← Q(S_t, A_t) + α [ R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t) ]
- Key Components:
- Temporal Difference (TD) Error: The term in brackets
[R + γ maxQ - Q]is the error between the current estimate and the new target. - Learning Rate (α): Controls how much the new estimate overrides the old one.
- Discount Factor (γ): Weights the importance of future rewards.
- Temporal Difference (TD) Error: The term in brackets
- Property: It is an off-policy update, learning the optimal Q* while following a different exploration policy (e.g., ε-greedy).
Bellman Equation for Policy Evaluation (TD(0))
This form is used to evaluate a policy π by iteratively updating the state-value function V(s) towards the expected return. It's the basis for the TD(0) algorithm.
- Update Rule: V(S_t) ← V(S_t) + α [ R_{t+1} + γV(S_{t+1}) - V(S_t) ]
- Key Difference from Q-Learning: The target uses
V(S_{t+1})(the value of the next state under the current policy) instead ofmax_a Q(...). It does not search for the maximum. - Purpose: Pure policy evaluation. It answers "What is the value of each state if I always follow policy π?" This is a key step in on-policy algorithms like SARSA and Actor-Critic methods where the critic evaluates the actor's current policy.
Advantage Function & Bellman Equation
The Advantage Function A^π(s, a) = Q^π(s, a) - V^π(s) measures how much better a specific action is compared to the average action under policy π. Its Bellman equation is derived from the standard forms.
- Bellman Equation for Advantage: A^π(s, a) = E_π[ R_t + γA^π(S_{t+1}, A_{t+1}) | S_t=s, A_t=a ]
- Interpretation: It effectively subtracts the state-value baseline V(s) from the action-value Q(s,a), reducing variance in policy gradient updates.
- Primary Use: Central to modern Actor-Critic and Policy Gradient algorithms like Proximal Policy Optimization (PPO) and Advantage Actor-Critic (A2C). The critic often learns V(s), and the advantage is estimated to update the actor's policy.
Model-Based Bellman Equation
When an agent learns or is given an explicit model of the environment's dynamics (transition function T(s'|s,a) and reward function R(s,a)), the Bellman equation can be used for planning.
- Planning Update: V(s) ← max_a Σ_{s'} T(s'|s,a) [ R(s,a,s') + γV(s') ]
- Key Difference: Instead of sampling experiences (S_t, A_t, R_{t+1}, S_{t+1}), the agent uses its internal model to simulate outcomes and compute expected values.
- Algorithms: Forms the basis for Model-Based Reinforcement Learning and planning algorithms like Value Iteration when the model is known. It allows the agent to "think ahead" by simulating trajectories without interacting with the real environment, improving sample efficiency.
Bellman Equation vs. Related Concepts
This table compares the Bellman Equation, the foundational recursive formula for value functions in reinforcement learning, against other core algorithmic concepts that either derive from it, compete with it, or are used in conjunction with it.
| Concept / Feature | Bellman Equation | Monte Carlo Methods | Temporal Difference (TD) Learning |
|---|---|---|---|
Core Principle | Recursive decomposition of value | Averaging returns from complete episodes | Bootstrapping from current value estimates |
Update Target | Expected value of next state + reward | Actual empirical return (G_t) | TD target: reward + γ * V(s') |
Model Requirement | Can be model-based or model-free | Strictly model-free | Strictly model-free |
Bias/Variance Tradeoff | Low variance, potential bias | Zero bias, high variance | Balanced bias and variance |
Update Timing | Can be per-step (TD) or per-episode | Only after episode termination | Per-step (online) |
Sample Efficiency | High (leverages bootstrapping) | Low (requires full episodes) | High (updates immediately) |
Primary Use Case | Dynamic programming, value iteration, Q-learning | Policy evaluation in episodic tasks | Online learning in continuing tasks |
Connection to Bellman | Is the defining equation | Converges to Bellman equation solution | Directly implements Bellman equation via bootstrapping |
Algorithms Using the Bellman Equation
The Bellman equation provides the foundational recursive relationship for value functions, enabling a suite of algorithms that learn optimal policies through iterative estimation and improvement.
Value Iteration
A dynamic programming algorithm that directly solves the Bellman optimality equation. It iteratively computes the optimal value function V*(s) for all states until convergence.
- Process: Starts with arbitrary value estimates, then repeatedly applies the Bellman backup operator: V_{k+1}(s) = max_a [ R(s,a) + γ Σ_s' P(s'|s,a) V_k(s') ].
- Output: The converged value function implicitly defines the optimal policy: π*(s) = argmax_a [ R(s,a) + γ Σ_s' P(s'|s,a) V*(s') ].
- Use Case: Requires a perfect model of the environment's dynamics (transition probabilities P and reward function R). Used for planning when the model is known.
Policy Iteration
An alternative dynamic programming method that alternates between two distinct phases: policy evaluation and policy improvement, leveraging the Bellman expectation equation.
- Policy Evaluation: Given a policy π, compute its value function V^π by solving the linear Bellman equation (or iterating until convergence).
- Policy Improvement: Greedily improve the policy based on the newly computed value function: π'(s) = argmax_a [ R(s,a) + γ Σ_s' P(s'|s,a) V^π(s') ].
- Guarantee: Each iteration produces a strictly better policy (unless it is already optimal). It often converges in fewer iterations than value iteration.
Q-Learning
A model-free, off-policy temporal difference (TD) control algorithm. It learns the optimal action-value function Q*(s,a) by applying a sampled version of the Bellman optimality equation.
- Update Rule: Q(s_t, a_t) ← Q(s_t, a_t) + α [ r_t + γ max_a' Q(s_{t+1}, a') - Q(s_t, a_t) ]. The term in brackets is the TD error.
- Off-Policy: It learns the value of the optimal policy (max over a') while following a different behavior policy (e.g., ε-greedy) for exploration.
- Foundation: The update directly implements a stochastic approximation of the Bellman optimality operator. Proven to converge to Q* under standard conditions.
SARSA (State-Action-Reward-State-Action)
A model-free, on-policy TD control algorithm. It learns the action-value function Q^π(s,a) for the policy π currently being executed, using the Bellman expectation equation.
- Update Rule: Q(s_t, a_t) ← Q(s_t, a_t) + α [ r_t + γ Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ].
- On-Policy: The update uses the action a_{t+1} that the agent actually takes from the next state, which is dictated by its current policy (e.g., ε-greedy).
- Result: Converges to Q^π for the given policy π. To find the optimal policy, π must eventually become greedy with respect to Q (e.g., via GLIE conditions).
Deep Q-Network (DQN)
A seminal algorithm that combines Q-Learning with deep neural networks as function approximators for the Q-function. It stabilizes training using key innovations to approximate the Bellman update.
- Network: A deep neural network parameterizes Q(s,a; θ).
- Key Techniques:
- Experience Replay: Stores transitions (s, a, r, s') in a buffer and samples mini-batches to break temporal correlations.
- Target Network: Uses a separate, slowly updated network to compute the TD target (r + γ max_a' Q(s', a'; θ^-)), preventing divergence.
- Impact: Demonstrated superhuman performance on Atari games, establishing deep reinforcement learning as a viable field.
Temporal Difference (TD) Learning
A broad class of model-free methods for estimating value functions. TD algorithms learn directly from raw experience by bootstrapping—updating estimates based on other estimates—as formalized by the Bellman equation.
- Core Idea: Instead of waiting for a full episode's return (like Monte Carlo), TD updates the value estimate V(s_t) immediately after transitioning to s_{t+1} and receiving reward r_t.
- TD(0) Update: V(s_t) ← V(s_t) + α [ r_t + γ V(s_{t+1}) - V(s_t) ]. This is the Bellman expectation equation for V^π, implemented as a sample update.
- Significance: Provides a data-efficient, online learning method that unifies Monte Carlo and dynamic programming ideas.
Frequently Asked Questions
The Bellman equation is the fundamental recursive relationship that defines optimal value in reinforcement learning, forming the core of how agents learn from delayed rewards. These questions address its mechanics, applications, and role in modern AI systems.
The Bellman equation is a recursive decomposition formula that expresses the value of a state (or state-action pair) in a Markov Decision Process (MDP) as the sum of the immediate reward and the discounted value of the successor state. It is the foundational principle of dynamic programming and reinforcement learning (RL), providing a mathematically tractable way to define optimality. For a state-value function V(s), the Bellman equation is: V(s) = max_a [ R(s,a) + γ * Σ_s' P(s'|s,a) * V(s') ], where R is the reward, γ is the discount factor, and P is the transition probability. This recursion allows the value of a complex, multi-step decision problem to be broken down into simpler, one-step subproblems.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The Bellman equation is the cornerstone of value estimation in reinforcement learning. These related concepts define the mechanisms for learning from feedback and planning optimal actions.
Value Function
The value function is the core prediction the Bellman equation decomposes. It estimates the expected cumulative future reward an agent can achieve starting from a given state (or state-action pair).
- State-Value Function (V(s)): Predicts the total reward from a state s under a specific policy.
- Action-Value Function (Q(s,a)): Predicts the total reward from taking action a in state s, then following the policy. The Bellman equation provides a recursive consistency condition that the optimal value function must satisfy.
Dynamic Programming
Dynamic Programming is the foundational algorithmic paradigm from which the Bellman equation originates. It solves complex problems by:
- Breaking them down into simpler overlapping subproblems.
- Solving each subproblem only once and storing its solution (memoization).
- Combining solutions to solve the overall problem. In RL, DP algorithms like policy iteration and value iteration use the Bellman equation as an update rule to compute optimal value functions and policies, assuming a perfect model of the environment's dynamics.
Temporal Difference (TD) Learning
Temporal Difference Learning is a class of model-free RL methods that operationalize the Bellman equation through online experience. Instead of requiring a full model, TD methods learn directly from samples of interaction by bootstrapping—updating estimates based on other estimates.
- TD Error: The core signal is the difference between the current value estimate and a better estimate formed from the immediate reward and the value of the next state:
δ = R + γV(S') - V(S). This is a direct realization of the Bellman equation's recursive form. - Algorithms: TD(0), SARSA, and Q-Learning are all built on this principle.
Markov Decision Process (MDP)
A Markov Decision Process is the formal mathematical framework that defines the RL problem the Bellman equation solves. An MDP is characterized by:
- States (S): A set of possible situations.
- Actions (A): A set of possible moves.
- Transition Function (P): The probability
P(s'|s,a)of moving to state s' from state s after taking action a. - Reward Function (R): The expected immediate reward
R(s,a,s'). - Discount Factor (γ): As in the Bellman equation. The Bellman equation is only valid because of the Markov Property, which states that the future depends only on the current state and action, not the full history.
Bootstrapping
Bootstrapping is the learning mechanism inspired by the Bellman equation's recursive nature. It refers to an agent updating its estimates based on other existing estimates, rather than waiting for a complete, final outcome (a Monte Carlo return).
- Mechanism: In the Bellman update
V(S) ← R + γV(S'), the value of state S is updated using the current estimate for the value of the next state S'. - Impact: This allows for online, incremental learning after every time step, but introduces bias because the target depends on an imperfect estimate. It is a key feature distinguishing TD methods from Monte Carlo methods.
Optimality Principle
The Principle of Optimality, articulated by Richard Bellman, is the foundational idea that underlies the Bellman equation. It states:
"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision."
- Implication: An optimal long-term strategy can be built from optimal solutions to its subproblems. This recursive structure is exactly what the Bellman optimality equation captures mathematically, allowing the problem of finding a global optimal policy to be solved via local, recursive value updates.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us