MDP (Markov Decision Process) Definition & Examples

AUTOMATED PLANNING SYSTEMS

What is MDP (Markov Decision Process)?

A Markov Decision Process (MDP) is the foundational mathematical framework for modeling sequential decision-making under uncertainty, forming the theoretical backbone of reinforcement learning and automated planning.

A Markov Decision Process (MDP) is a discrete-time stochastic control process that provides a formal model for decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is defined by a tuple (S, A, P, R, γ), where S is a set of states, A is a set of actions, P defines transition probabilities between states, R is a reward function, and γ is a discount factor. The core Markov property ensures the future state depends only on the current state and action, not the full history.

The objective in an MDP is to find an optimal policy—a mapping from states to actions—that maximizes the expected cumulative discounted reward. Algorithms like value iteration and policy iteration solve MDPs by leveraging the Bellman optimality equations. MDPs are extended to Partially Observable MDPs (POMDPs) for environments with hidden states and form the basis for model-based reinforcement learning, where an agent learns the transition and reward dynamics to plan sequences of actions.

MATHEMATICAL FORMALISM

Core Components of an MDP

A Markov Decision Process (MDP) provides a rigorous mathematical framework for modeling sequential decision-making under uncertainty. It is defined by five core components that together specify the dynamics of the environment and the agent's objective.

State Space (S)

The state space is the set of all possible configurations or situations the environment can be in. Each state s ∈ S is a complete description of the world at a specific time, sufficient to determine future dynamics. For example, in a robot navigation task, a state could be its precise (x, y) coordinates on a grid. The Markov property assumes the future state depends only on the current state and action, not the full history.

Action Space (A)

MECHANISM

How Does an MDP Work?

A Markov Decision Process (MDP) is the foundational mathematical framework for modeling sequential decision-making under uncertainty, forming the core of reinforcement learning and automated planning systems.

An MDP is formally defined by the tuple (S, A, P, R, γ). S is a finite set of states. A is a finite set of actions available to the agent. P is the state transition probability function, where P(s'|s,a) defines the probability of moving to state s' from state s after taking action a. R is the reward function, where R(s,a,s') specifies the immediate scalar feedback. γ (gamma) is the discount factor, a value between 0 and 1 that determines the present value of future rewards.

The agent's objective is to find a policy π(a|s)—a mapping from states to actions—that maximizes the expected cumulative discounted reward, known as the return. This optimization is governed by the Bellman equations, which recursively define the value of a state or state-action pair. Solving an MDP involves computing this optimal policy, typically through dynamic programming methods like value iteration or policy iteration, which iteratively refine value estimates until convergence.

MARKOV DECISION PROCESS

Frequently Asked Questions

A Markov Decision Process (MDP) is the foundational mathematical framework for modeling sequential decision-making under uncertainty, central to reinforcement learning and automated planning. These questions address its core components, applications, and relationship to other planning paradigms.

A Markov Decision Process (MDP) is a formal, discrete-time stochastic control process that provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker (agent). It is defined by the tuple (S, A, P, R, γ), where:

S is a finite set of states.
A is a finite set of actions.
P(s' | s, a) is the state transition probability, the probability of transitioning to state s' from state s after taking action a.
R(s, a, s') is the reward function, the immediate scalar reward received after transitioning from s to s' via action a.
γ (gamma) is the discount factor, a number between 0 and 1 that determines the present value of future rewards.

The core property is the Markov property: the future state and reward depend only on the current state and action, not on the entire history. This 'memoryless' property is what makes the problem tractable for computation.

MDP (Markov Decision Process)

What is MDP (Markov Decision Process)?

Core Components of an MDP

State Space (S)

Action Space (A)

How Does an MDP Work?

Frequently Asked Questions

Transition Function (P)

Reward Function (R)

Discount Factor (γ)

Policy (π) & Value Functions

Bellman Equation

Model-Based Reinforcement Learning

Monte Carlo Tree Search (MCTS)

Hierarchical MDPs / Options Framework

MDP (Markov Decision Process)

What is MDP (Markov Decision Process)?

Core Components of an MDP

State Space (S)

Action Space (A)

How Does an MDP Work?

Frequently Asked Questions

Related Terms

POMDP (Partially Observable MDP)

Policy

Transition Function (P)

Reward Function (R)

Discount Factor (γ)

Policy (π) & Value Functions

Bellman Equation

Model-Based Reinforcement Learning

Monte Carlo Tree Search (MCTS)

Hierarchical MDPs / Options Framework