Inferensys

Glossary

Policy (in RL/Planning)

A policy is a strategy or mapping that defines which action an agent should take in a given state (or belief state) to maximize its expected cumulative reward.
Strategy workshop with sticky notes and AI roadmap diagrams on glass wall, collaborative planning session.
DEFINITION

What is Policy (in RL/Planning)?

A policy is the core decision-making rule in reinforcement learning and automated planning, dictating an agent's behavior.

In reinforcement learning (RL) and automated planning, a policy is a mapping from states (or belief states) to actions that defines an agent's strategy for selecting actions to maximize its expected cumulative reward. In Markov Decision Processes (MDPs), it is a function π(s) → a, while in Partially Observable MDPs (POMDPs), it maps belief states to actions. Policies can be deterministic, specifying a single action per state, or stochastic, defining a probability distribution over actions.

The goal of RL algorithms like policy gradient methods is to learn an optimal policy π* through interaction. In planning, a policy can be the output of a contingent planner, forming a conditional plan tree. Key related concepts include the value function, which evaluates a policy's expected return, and Bellman optimality equations, which define the conditions for an optimal policy. Model-based RL often uses a learned world model to improve policy search efficiency.

AUTOMATED PLANNING SYSTEMS

Key Characteristics of a Policy

In automated planning and reinforcement learning, a policy is the core decision-making rule that maps an agent's perceived situation to its next action. Its properties define the agent's behavior, efficiency, and adaptability.

01

Mapping from State to Action

A policy is fundamentally a mapping function, denoted as π(s) → a. It defines the agent's strategy by specifying which action a to take from the action space when in a given state s from the state space. This mapping can be:

  • Deterministic: A single, fixed action for each state.
  • Stochastic: A probability distribution over possible actions for a state, enabling exploration. In Partially Observable MDPs (POMDPs), the policy maps from a belief state (a probability distribution over possible true states) to an action.
02

Optimization for Cumulative Reward

The policy's objective is to maximize the agent's expected cumulative reward, often a discounted sum of future rewards. It is not designed to optimize for a single immediate reward but for long-term success. This optimization is formalized by the Bellman equation, which defines the value of following a policy from a given state. In reinforcement learning, algorithms like Policy Gradient directly parameterize and optimize the policy function itself to improve this expected return.

03

Representation and Parameterization

Policies can be represented in various forms, impacting their expressiveness and learnability:

  • Tabular: A simple lookup table for discrete state and action spaces.
  • Parametric (e.g., Neural Network): A function approximator like a deep neural network that can generalize to unseen states, essential for large or continuous spaces. The weights of the network are the policy parameters.
  • Symbolic/Logic-based: A set of rules or conditions, common in classical planning domains. The choice of representation directly affects the policy search process during learning.
04

Stationary vs. Non-Stationary

A stationary policy is one where the action choice depends only on the current state (or belief state), not on the time step. Most MDP formulations assume stationary policies. A non-stationary policy can select different actions for the same state at different time steps, which can be necessary for finite-horizon problems or when the environment's dynamics change over time. In planning, a contingent policy (a plan tree) is inherently non-stationary, as future actions depend on prior observations.

05

Relationship to Plans and Value Functions

A policy is a more general concept than a simple plan (a linear sequence of actions). A plan is one possible rollout or trajectory generated by following a specific policy. Conversely, an optimal policy can be derived from an optimal value function (V*(s) or Q*(s,a)) by acting greedily with respect to it. In model-based RL and planning, policies are often evaluated and improved by simulating their outcomes using an internal world model.

06

Exploration vs. Exploitation Trade-off

A critical characteristic, especially during learning, is how the policy balances exploration (trying new actions to gather information) and exploitation (choosing the best-known action to maximize reward). Stochastic policies naturally facilitate exploration. Algorithms address this explicitly: ε-greedy policies exploit most of the time but explore randomly with probability ε, while policies optimized via the Upper Confidence Bound (UCB) principle, as used in Monte Carlo Tree Search (MCTS), quantify uncertainty to guide exploration intelligently.

POLICY (IN RL/PLANNING)

Frequently Asked Questions

A policy is the core decision-making component of an autonomous agent. These questions address its definition, implementation, and role within automated planning and reinforcement learning systems.

A policy is a complete mapping or strategy that defines which action an agent should execute in any given state (or belief state) to maximize its expected long-term cumulative reward. In Reinforcement Learning (RL), a policy is the function the agent learns, often denoted as π(a|s), representing the probability of taking action 'a' in state 's'. In automated planning, a policy can be a simple sequence of actions (a plan) or a more complex conditional strategy that specifies different actions based on runtime observations.

  • In RL: The policy is the solution to the Markov Decision Process (MDP). It can be deterministic (π(s) = a) or stochastic (π(a|s) = probability).
  • In Planning: For deterministic environments, a policy is a linear plan. For uncertain environments (modeled as a POMDP), it is a mapping from belief states to actions.

The agent's ultimate performance is evaluated by the total reward obtained by following its policy.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.