Reward Modeling is the process of training a separate machine learning model, called a reward model, to predict a scalar reward signal that captures human preferences or desired behavior. This learned reward function is then used to train or fine-tune a primary policy or language model via reinforcement learning, most commonly using the Proximal Policy Optimization (PPO) algorithm. The technique is foundational to Reinforcement Learning from Human Feedback (RLHF), enabling the alignment of powerful AI systems with complex, difficult-to-specify human values.
