In reinforcement learning (RL) and automated planning, a policy is a mapping from states (or belief states) to actions that defines an agent's strategy for selecting actions to maximize its expected cumulative reward. In Markov Decision Processes (MDPs), it is a function π(s) → a, while in Partially Observable MDPs (POMDPs), it maps belief states to actions. Policies can be deterministic, specifying a single action per state, or stochastic, defining a probability distribution over actions.
Glossary
Policy (in RL/Planning)

What is Policy (in RL/Planning)?
A policy is the core decision-making rule in reinforcement learning and automated planning, dictating an agent's behavior.
The goal of RL algorithms like policy gradient methods is to learn an optimal policy π* through interaction. In planning, a policy can be the output of a contingent planner, forming a conditional plan tree. Key related concepts include the value function, which evaluates a policy's expected return, and Bellman optimality equations, which define the conditions for an optimal policy. Model-based RL often uses a learned world model to improve policy search efficiency.
Key Characteristics of a Policy
In automated planning and reinforcement learning, a policy is the core decision-making rule that maps an agent's perceived situation to its next action. Its properties define the agent's behavior, efficiency, and adaptability.
Mapping from State to Action
A policy is fundamentally a mapping function, denoted as π(s) → a. It defines the agent's strategy by specifying which action a to take from the action space when in a given state s from the state space. This mapping can be:
- Deterministic: A single, fixed action for each state.
- Stochastic: A probability distribution over possible actions for a state, enabling exploration. In Partially Observable MDPs (POMDPs), the policy maps from a belief state (a probability distribution over possible true states) to an action.
Optimization for Cumulative Reward
The policy's objective is to maximize the agent's expected cumulative reward, often a discounted sum of future rewards. It is not designed to optimize for a single immediate reward but for long-term success. This optimization is formalized by the Bellman equation, which defines the value of following a policy from a given state. In reinforcement learning, algorithms like Policy Gradient directly parameterize and optimize the policy function itself to improve this expected return.
Representation and Parameterization
Policies can be represented in various forms, impacting their expressiveness and learnability:
- Tabular: A simple lookup table for discrete state and action spaces.
- Parametric (e.g., Neural Network): A function approximator like a deep neural network that can generalize to unseen states, essential for large or continuous spaces. The weights of the network are the policy parameters.
- Symbolic/Logic-based: A set of rules or conditions, common in classical planning domains. The choice of representation directly affects the policy search process during learning.
Stationary vs. Non-Stationary
A stationary policy is one where the action choice depends only on the current state (or belief state), not on the time step. Most MDP formulations assume stationary policies. A non-stationary policy can select different actions for the same state at different time steps, which can be necessary for finite-horizon problems or when the environment's dynamics change over time. In planning, a contingent policy (a plan tree) is inherently non-stationary, as future actions depend on prior observations.
Relationship to Plans and Value Functions
A policy is a more general concept than a simple plan (a linear sequence of actions). A plan is one possible rollout or trajectory generated by following a specific policy. Conversely, an optimal policy can be derived from an optimal value function (V*(s) or Q*(s,a)) by acting greedily with respect to it. In model-based RL and planning, policies are often evaluated and improved by simulating their outcomes using an internal world model.
Exploration vs. Exploitation Trade-off
A critical characteristic, especially during learning, is how the policy balances exploration (trying new actions to gather information) and exploitation (choosing the best-known action to maximize reward). Stochastic policies naturally facilitate exploration. Algorithms address this explicitly: ε-greedy policies exploit most of the time but explore randomly with probability ε, while policies optimized via the Upper Confidence Bound (UCB) principle, as used in Monte Carlo Tree Search (MCTS), quantify uncertainty to guide exploration intelligently.
Frequently Asked Questions
A policy is the core decision-making component of an autonomous agent. These questions address its definition, implementation, and role within automated planning and reinforcement learning systems.
A policy is a complete mapping or strategy that defines which action an agent should execute in any given state (or belief state) to maximize its expected long-term cumulative reward. In Reinforcement Learning (RL), a policy is the function the agent learns, often denoted as π(a|s), representing the probability of taking action 'a' in state 's'. In automated planning, a policy can be a simple sequence of actions (a plan) or a more complex conditional strategy that specifies different actions based on runtime observations.
- In RL: The policy is the solution to the Markov Decision Process (MDP). It can be deterministic (π(s) = a) or stochastic (π(a|s) = probability).
- In Planning: For deterministic environments, a policy is a linear plan. For uncertain environments (modeled as a POMDP), it is a mapping from belief states to actions.
The agent's ultimate performance is evaluated by the total reward obtained by following its policy.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A policy is the core decision-making component of an autonomous agent. Understanding its related concepts is essential for designing robust planning and reinforcement learning systems.
Markov Decision Process (MDP)
An MDP is the foundational mathematical framework for modeling sequential decision-making where outcomes are partly random and partly under the control of a decision-maker. It is defined by a tuple (S, A, P, R, γ):
- S: A finite set of states.
- A: A finite set of actions.
- P: Transition probability function, P(s'|s, a).
- R: Reward function, R(s, a, s').
- γ: Discount factor, 0 ≤ γ ≤ 1. A policy in an MDP is a mapping π(a|s) from states to actions (or distributions over actions). The goal is to find an optimal policy π* that maximizes the expected cumulative discounted reward.
Partially Observable MDP (POMDP)
A POMDP extends the MDP framework to environments where the agent cannot directly observe the true underlying state. It introduces:
- Ω: A set of observations.
- O: Observation probability function, O(o|s', a). Because the state is hidden, the agent maintains a belief state—a probability distribution over possible states. A policy in a POMDP is therefore a mapping from belief states to actions, π(a|b). Solving a POMDP involves finding a policy that maximizes expected reward while reasoning under uncertainty, making it critical for real-world applications like robotics and dialogue systems.
Value Function
A value function estimates the long-term desirability of being in a state (or of taking an action from a state) under a given policy. The two primary types are:
- State-Value Function Vπ(s): The expected return (cumulative reward) starting from state s and following policy π thereafter.
- Action-Value Function Qπ(s, a): The expected return starting from state s, taking action a, and thereafter following policy π. The Bellman equation provides a recursive definition for these functions. The optimal policy π* is intimately linked to the optimal value functions V* and Q*; it is defined by selecting actions that maximize Q*(s, a).
Policy Gradient Methods
Policy gradient methods are a class of reinforcement learning algorithms that directly optimize the parameters θ of a parameterized policy π_θ(a|s). Instead of learning value functions and deriving a policy (as in Q-learning), they adjust θ to maximize the expected reward J(θ) using gradient ascent.
- REINFORCE: A Monte Carlo policy gradient algorithm.
- Actor-Critic: Combines a policy (actor) with a value function (critic) to reduce variance in gradient estimates. These methods are particularly effective for high-dimensional or continuous action spaces and are the foundation for many advanced algorithms like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC).
Deterministic vs. Stochastic Policy
Policies are categorized by how they select actions:
- Deterministic Policy: Maps a state (or belief state) directly to a single action, a = μ(s). This is simple and memory-efficient but may not explore the environment sufficiently during learning and can be brittle in stochastic environments.
- Stochastic Policy: Specifies a probability distribution over actions for a given state, π(a|s). This is essential for exploration during training and is often necessary for optimal behavior in adversarial or partially observable settings (e.g., mixed strategies in game theory). Most policy gradient methods learn stochastic policies, while algorithms like Deep Deterministic Policy Gradient (DDPG) learn deterministic ones.
Stationary vs. Non-Stationary Policy
This distinction is crucial in finite-horizon and time-dependent planning problems.
- Stationary Policy: The action choice depends only on the current state (or belief state), not on the time step. π_t(a|s) = π(a|s) for all t. Most infinite-horizon MDPs have optimal policies that are stationary.
- Non-Stationary Policy: The action choice can vary with both the state and the time step, π_t(a|s). These are common in finite-horizon problems where the time remaining influences the optimal decision (e.g., aggressive maneuvers are less desirable near a deadline). Contingent plans in temporal planning often result in non-stationary policies.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us