Reward overoptimization occurs when a reinforcement learning agent, trained via algorithms like Proximal Policy Optimization (PPO), maximizes a flawed or incomplete reward function so aggressively that its true performance on the intended objective sharply declines. This is not mere reward hacking but a deeper distributional shift problem: the agent's policy drifts into regions of state space where the proxy reward is high but the real-world outcome is poor or catastrophic. It is a fundamental challenge in Reinforcement Learning from AI Feedback (RLAIF) and scalable oversight, where the reward model is an imperfect stand-in for human intent.
