Inferensys

Glossary

MDP (Markov Decision Process)

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making under uncertainty, defined by states, actions, transition probabilities, and rewards.
Governance lead reviewing model governance framework on laptop, policy documents visible, executive office setup.
AUTOMATED PLANNING SYSTEMS

What is MDP (Markov Decision Process)?

A Markov Decision Process (MDP) is the foundational mathematical framework for modeling sequential decision-making under uncertainty, forming the theoretical backbone of reinforcement learning and automated planning.

A Markov Decision Process (MDP) is a discrete-time stochastic control process that provides a formal model for decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is defined by a tuple (S, A, P, R, γ), where S is a set of states, A is a set of actions, P defines transition probabilities between states, R is a reward function, and γ is a discount factor. The core Markov property ensures the future state depends only on the current state and action, not the full history.

The objective in an MDP is to find an optimal policy—a mapping from states to actions—that maximizes the expected cumulative discounted reward. Algorithms like value iteration and policy iteration solve MDPs by leveraging the Bellman optimality equations. MDPs are extended to Partially Observable MDPs (POMDPs) for environments with hidden states and form the basis for model-based reinforcement learning, where an agent learns the transition and reward dynamics to plan sequences of actions.

MATHEMATICAL FORMALISM

Core Components of an MDP

A Markov Decision Process (MDP) provides a rigorous mathematical framework for modeling sequential decision-making under uncertainty. It is defined by five core components that together specify the dynamics of the environment and the agent's objective.

01

State Space (S)

The state space is the set of all possible configurations or situations the environment can be in. Each state s ∈ S is a complete description of the world at a specific time, sufficient to determine future dynamics. For example, in a robot navigation task, a state could be its precise (x, y) coordinates on a grid. The Markov property assumes the future state depends only on the current state and action, not the full history.

02

Action Space (A)

The action space is the set of all primitive decisions or controls available to the agent. From any given state s, the agent can choose an action a ∈ A(s), where A(s) denotes the actions applicable in state s. Actions can be discrete (e.g., {up, down, left, right}) or continuous (e.g., a torque value). The size and structure of the action space directly impact the complexity of finding an optimal policy.

03

Transition Function (P)

The transition function, P(s' | s, a), defines the dynamics of the environment. It is a probability distribution over next states s' given the current state s and action a. This function encodes the inherent uncertainty in the world. For a deterministic environment, P(s' | s, a) = 1 for a single outcome. In stochastic environments, it represents a probability mass or density function, such as a robot's movement succeeding with 90% probability or slipping with 10%.

04

Reward Function (R)

The reward function provides the agent's objective, defining the immediate desirability of transitioning from state s to state s' via action a. It is typically expressed as R(s, a, s') or the expected reward R(s, a). Rewards are scalar signals that the agent seeks to maximize over time. For instance, reaching a goal might yield +10, falling into a pit -10, and every other step -1 to encourage efficiency. The reward function is the sole specification of the task's goal.

05

Discount Factor (γ)

The discount factor, γ ∈ [0, 1], is a hyperparameter that determines the present value of future rewards. It is used to compute the return, which is the sum of discounted future rewards: G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + .... A γ close to 1 makes the agent far-sighted, heavily weighing long-term outcomes. A γ close to 0 makes it myopic, focusing on immediate rewards. It also ensures the return is finite in infinite-horizon problems.

06

Policy (π) & Value Functions

While not a defining component of the MDP itself, the solution is a policy π(a|s)—a strategy mapping states to actions (or distributions over actions). From the MDP components, we derive value functions that evaluate policies:

  • State-Value Function V^π(s): Expected return starting from state s and following π.
  • Action-Value Function Q^π(s, a): Expected return after taking action a in state s, then following π. The core objective of solving an MDP is to find an optimal policy π* that maximizes these value functions.
MECHANISM

How Does an MDP Work?

A Markov Decision Process (MDP) is the foundational mathematical framework for modeling sequential decision-making under uncertainty, forming the core of reinforcement learning and automated planning systems.

An MDP is formally defined by the tuple (S, A, P, R, γ). S is a finite set of states. A is a finite set of actions available to the agent. P is the state transition probability function, where P(s'|s,a) defines the probability of moving to state s' from state s after taking action a. R is the reward function, where R(s,a,s') specifies the immediate scalar feedback. γ (gamma) is the discount factor, a value between 0 and 1 that determines the present value of future rewards.

The agent's objective is to find a policy π(a|s)—a mapping from states to actions—that maximizes the expected cumulative discounted reward, known as the return. This optimization is governed by the Bellman equations, which recursively define the value of a state or state-action pair. Solving an MDP involves computing this optimal policy, typically through dynamic programming methods like value iteration or policy iteration, which iteratively refine value estimates until convergence.

MARKOV DECISION PROCESS

Frequently Asked Questions

A Markov Decision Process (MDP) is the foundational mathematical framework for modeling sequential decision-making under uncertainty, central to reinforcement learning and automated planning. These questions address its core components, applications, and relationship to other planning paradigms.

A Markov Decision Process (MDP) is a formal, discrete-time stochastic control process that provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker (agent). It is defined by the tuple (S, A, P, R, γ), where:

  • S is a finite set of states.
  • A is a finite set of actions.
  • P(s' | s, a) is the state transition probability, the probability of transitioning to state s' from state s after taking action a.
  • R(s, a, s') is the reward function, the immediate scalar reward received after transitioning from s to s' via action a.
  • γ (gamma) is the discount factor, a number between 0 and 1 that determines the present value of future rewards.

The core property is the Markov property: the future state and reward depend only on the current state and action, not on the entire history. This 'memoryless' property is what makes the problem tractable for computation.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.