Markov Decision Process (MDP) Definition & Examples

MATHEMATICAL FRAMEWORK

Core Components of an MDP

A Markov Decision Process (MDP) is a formal, mathematical framework for modeling sequential decision-making under uncertainty. It is defined by five core components that together specify the rules of interaction between an agent and its environment.

State Space (S)

The State Space is the set of all possible situations or configurations the environment can be in. It is a complete description of the world at a given time, with no hidden information. The Markov Property states that the future state depends only on the current state and action, not on the full history.

Examples: A robot's (x, y) coordinates; the board configuration in chess; the inventory levels in a supply chain.
Key Consideration: The state must be defined to be sufficiently informative for decision-making while being as compact as possible to avoid the curse of dimensionality.

Action Space (A)

The Action Space is the set of all possible moves or decisions the agent can make from a given state. Actions are the agent's mechanism for influencing the environment and transitioning between states.

Discrete vs. Continuous: Actions can be discrete (e.g., {left, right, up, down}) or continuous (e.g., a torque value between -1 and 1).
State-Dependent Actions: The set of available actions A(s) can vary depending on the current state s. For instance, some moves are illegal in certain chess positions.

Transition Function (P)

The Transition Function, or dynamics model, defines the probability of moving to a new state given the current state and action. It is formally written as P(s' | s, a), representing the probability of transitioning to state s' after taking action a in state s.

Stochasticity: This function encodes the inherent uncertainty in the environment. Even with the same state and action, multiple outcomes are possible.
Model-Based vs. Model-Free: In model-based RL, the agent learns or is given this function. In model-free RL, the agent learns a policy directly without explicitly modeling P.

Reward Function (R)

The Reward Function provides a scalar feedback signal R(s, a, s') to the agent after each transition. It defines the goal of the MDP by quantifying the desirability of moving from state s to s' via action a. The agent's objective is to maximize the cumulative discounted reward over time.

Sparse vs. Dense Rewards: Sparse rewards (e.g., +1 for winning, 0 otherwise) are hard to learn from. Reward shaping adds intermediate rewards to guide learning.
Design Challenge: A poorly designed reward function can lead to reward hacking, where the agent finds unintended ways to maximize reward without solving the intended task.

Discount Factor (γ)

The Discount Factor, gamma (γ), is a number between 0 and 1 that determines the present value of future rewards. It is used to compute the discounted return: G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ...

Purpose: It ensures the infinite sum of rewards converges mathematically and allows the agent to temporally weigh rewards.
Interpretation: A γ close to 1 makes the agent far-sighted (valuing long-term rewards). A γ close to 0 makes it myopic (focusing on immediate rewards). It can also be interpreted as a probability of episode continuation.

Policy (π) & Value Functions

While not a defining component of the MDP itself, the Policy and Value Functions are the core solutions derived from it.

Policy (π): The agent's strategy, mapping states to actions (π(a|s)). It can be deterministic or stochastic.
Value Functions: Estimate how good a state or state-action pair is.
- State-Value Function V^π(s): Expected return starting from state s and following policy π.
- Action-Value Function Q^π(s, a): Expected return after taking action a in state s and then following π.
These functions satisfy the Bellman Equations, which are recursive consistency conditions central to most RL algorithms.

APPLICATIONS

MDP Examples in Robotics & AI

A Markov Decision Process (MDP) provides the foundational mathematical model for sequential decision-making under uncertainty. These examples illustrate how its core components—states, actions, transitions, and rewards—are instantiated in real-world robotic and AI systems.

Autonomous Navigation

An autonomous mobile robot uses an MDP to plan a path from point A to B while avoiding obstacles.

State: The robot's (x, y) coordinates, orientation, and a map of static obstacles.
Action: Commands like 'move forward 0.5m', 'rotate left 15 degrees', or 'stop'.
Transition: The probability the robot successfully moves to the intended cell, accounting for wheel slippage or localization error.
Reward: A small negative reward for each time step (to encourage speed) and a large negative reward for hitting an obstacle. A large positive reward is given upon reaching the goal.

Robotic Manipulation & Grasping

A robotic arm learns to pick up a diverse set of objects from a bin. This is a classic Partially Observable MDP (POMDP) problem, as the exact pose and properties of occluded objects are uncertain.

State: The estimated 6D pose of the gripper and perceived point cloud of the workspace.
Action: Joint torque commands or end-effector velocity in (x, y, z, roll, pitch, yaw).
Transition: The complex physics of contact between gripper, object, and bin.
Reward: +1 for a successful lift and hold, -0.1 for a failed grasp, and a large negative reward for excessive force or dropping an object.

Inventory Management Robot

A warehouse robot orchestrates fetching and storing items to optimize throughput. This can be modeled as a Multi-Agent MDP when multiple robots share the space.

State: Location of all robots, inventory levels at pick stations, and a queue of pending orders.
Action: For a single robot: 'go to shelf A12', 'retrieve item X', 'deliver to station 3'.
Transition: Movement success depends on traffic congestion and other agents' actions.
Reward: Based on orders fulfilled per hour. A reward is given for each delivered item, with a penalty for energy use and for collisions (requiring safe RL constraints).

Bipedal Locomotion Control

Teaching a humanoid robot to walk involves a continuous-state, continuous-action MDP with a high-dimensional state space.

State: Joint angles and velocities, torso orientation, center of pressure, and IMU data.
Action: Target torques applied to each joint actuator.
Transition: Governed by rigid-body dynamics and contact forces with the ground.
Reward: A shaped reward function combining:
- Positive reward for forward velocity.
- Negative reward for torso tilt from upright.
- Negative reward for high energy consumption.
- Large negative reward (episode termination) for falling.

Sim-to-Real Policy Training

Model-Based RL heavily utilizes the MDP formalism. A drone is trained in a physics simulator (a known MDP model) to perform acrobatic maneuvers before transfer to a physical drone.

State: In simulation, the exact pose, velocity, and angular momentum of the drone.
Action: RPMs sent to each of the four rotors.
Transition: The simulator's physics engine provides a deterministic or stochastic model.
Reward: Shaped to perform a flip: reward for tracking a target orientation trajectory. The core challenge is Sim-to-Real transfer, where the real-world transition dynamics differ from the simulated MDP, requiring domain randomization during training.

Interactive Task Learning

A personal assistive robot learns to perform household tasks like setting a table via human feedback, often framed as an Inverse Reinforcement Learning (IRL) problem over an MDP.

State: The arrangement of plates, cutlery, and glasses on a table and in cabinets.
Action: 'Pick up plate', 'Move to location X', 'Place object'.
Transition: The outcome of manipulation actions.
Reward: Unknown. The robot observes human demonstrations (optimal state-action trajectories) and infers the latent reward function (e.g., plates go directly in front of chairs, forks on the left). The learned MDP reward function allows the robot to generalize to new table configurations.

FRAMEWORK COMPARISON

MDP Extensions and Related Models

A comparison of mathematical frameworks that extend or modify the core Markov Decision Process to address specific challenges in sequential decision-making, particularly relevant to robotics and embodied intelligence.

Core Feature / Assumption	Markov Decision Process (MDP)	Partially Observable MDP (POMDP)	Multi-Agent MDP (MAMDP)	Hierarchical MDP (HMDP)
State Observability
Agent Count	Single	Single	Multiple (N)	Single
Core Challenge Addressed	Optimal policy under full information	Decision-making under uncertainty & partial sensing	Inter-agent interaction (cooperation/competition)	Temporal abstraction for long-horizon tasks
Typical Solution Approach	Dynamic Programming, RL (Q-learning, Policy Gradients)	Belief state estimation, POMDP-specific planners (PBVI)	Equilibrium concepts (Nash, correlated), Centralized training with decentralized execution	Option frameworks, Skill discovery, Manager-worker policies
Primary Use Case in Robotics	Theoretical foundation; planning in fully known, simulated worlds	Real-world navigation & manipulation with noisy sensors (e.g., robot unsure of object location)	Fleet coordination, swarm robotics, human-robot collaboration	Complex task decomposition (e.g., "make coffee" -> navigate, grasp, pour)
Computational Complexity	Polynomial (in states & actions)	PSPACE-Complete (intractable for large belief spaces)	Generally NEXP-Complete (scales poorly with agents)	Depends on hierarchy depth; often uses semi-MDP theory
Key Algorithmic Component	Value function V(s) or Q(s,a)	Belief state b(s), Belief updater	Joint action space, Other agents' policies	Options, Macro-actions, Sub-policies
Connection to Core MDP	Base model	Generalization (MDP is a POMDP with identity observation function)	Extension of state/action space to multi-agent setting	Extension of action space to include temporally extended actions

REINFORCEMENT LEARNING FOR ROBOTICS

Related Terms

The Markov Decision Process is the foundational mathematical model for sequential decision-making. These related concepts define the algorithms, extensions, and practical challenges of applying MDPs to train autonomous systems.

Partially Observable MDP (POMDP)

A Partially Observable Markov Decision Process extends the MDP framework to model environments where an agent cannot directly perceive the true state. Instead, it receives observations that provide noisy or incomplete information. This is critical for robotics, where sensors (e.g., cameras, LiDAR) provide partial data.

Core Challenge: The agent must maintain a belief state—a probability distribution over possible true states—and act based on this belief.
Application: Essential for any real-world robot operating under sensor limitations or occlusion.

Bellman Equation

The Bellman equation provides the recursive, mathematical foundation for solving MDPs. It decomposes the value function—the expected cumulative reward from a state—into the immediate reward plus the discounted value of the successor state.

Dynamic Programming Core: This recursion enables efficient computation of optimal value functions and policies via algorithms like Value Iteration and Policy Iteration.
Formulation: For a state s, the optimal value V*(s) = max_a [ R(s,a) + γ * Σ_s' P(s'|s,a) * V*(s') ], where γ is the discount factor.

Model-Based vs. Model-Free RL

This distinction defines how an RL agent interacts with the MDP's dynamics.

Model-Based RL: The agent learns or is given an explicit model of the transition probabilities P(s'|s,a) and reward function R(s,a). It uses this model for planning (e.g., via tree search) to improve sample efficiency. Examples include Dyna and Model Predictive Control (MPC).
Model-Free RL: The agent learns a policy or value function directly from experience without explicitly modeling dynamics. Examples include Q-Learning and Policy Gradient methods. The trade-off is typically between sample efficiency (model-based) and final performance/complexity (model-free).

Exploration-Exploitation Tradeoff

The fundamental dilemma an agent faces in an MDP: whether to explore new actions to gather information about the environment or exploit known actions that yield high reward.

Balancing Mechanisms: Algorithms use specific strategies to manage this trade-off.
- ε-greedy: Choose a random action with probability ε, otherwise the best-known action.
- Upper Confidence Bound (UCB): Prefer actions with high uncertainty in their value estimate.
- Thompson Sampling: Maintain a distribution over value estimates and sample an action accordingly.
Critical in Robotics: Poor exploration can lead to policy failure, while excessive exploration can be unsafe or inefficient.

Hierarchical Reinforcement Learning (HRL)

A framework for solving complex, long-horizon MDPs by introducing temporal abstraction. The problem is decomposed into a hierarchy of subtasks or skills.

Options Framework: A formalization where an option is a temporally extended action consisting of a policy, termination condition, and initiation set.
Benefits: Enables transfer learning, improves exploration in sparse-reward settings, and makes planning over long time horizons computationally tractable.
Robotics Use Case: A mobile manipulator might have high-level options like navigate_to(table) and low-level motor controls, allowing it to efficiently learn complex tasks like "set the table."

Safe Reinforcement Learning

An area of RL focused on ensuring that an agent's learning process and resulting policy satisfy safety constraints, which are often not captured by the standard MDP reward function.

Constrained MDPs (CMDPs): A common formal extension that includes cost functions the agent must keep below a threshold.
Key Techniques:
- Constrained Policy Optimization: Modify policy updates to respect cost limits.
- Risk-Sensitive Criteria: Optimize for conditional value-at-risk (CVaR) instead of expected return.
- Shielding: Use a verified safety filter to override unsafe actions proposed by the learning agent.
Imperative for Physical Systems: Prevents robots from causing damage to themselves, their environment, or humans during training and deployment.

Markov Decision Process (MDP)

What is a Markov Decision Process (MDP)?

Core Components of an MDP

State Space (S)

Action Space (A)

Transition Function (P)

Reward Function (R)

Discount Factor (γ)

Policy (π) & Value Functions

How Does an MDP Work?

MDP Examples in Robotics & AI

Autonomous Navigation

Robotic Manipulation & Grasping

Inventory Management Robot

Bipedal Locomotion Control

Sim-to-Real Policy Training

Interactive Task Learning

MDP Extensions and Related Models

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Search across company data

Automate internal workflows

Add AI to products and internal tools

Review the use case

Pick the right approach

Build the first useful version

Improve from there