Policy gradient methods are a class of reinforcement learning (RL) algorithms that directly optimize the parameters of a policy function—which maps states to actions—by estimating the gradient of expected cumulative reward with respect to those parameters. Unlike value-based methods like Q-learning, which learn a value function and derive a policy indirectly, policy gradients adjust the policy in the direction that increases the probability of high-reward actions. This direct approach is particularly well-suited for continuous action spaces and complex, stochastic policies, forming the foundation for advanced algorithms like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC).




