A policy is a function, often denoted as π(s) or π(a|s), that maps from an observed state of the environment to an action (or a probability distribution over actions) to be taken by an agent. In deterministic policies, the mapping is direct; in stochastic policies, it defines a probability for each possible action. This function is the agent's strategy, defining its behavior for every situation it might encounter. The ultimate goal in reinforcement learning is to learn an optimal policy, π*, that maximizes the expected cumulative reward over time.
