Preference modeling is the machine learning task of training a model, typically a reward model, to predict a preference ranking between different outputs. It captures nuanced human or AI judgments about quality, safety, and alignment, forming the critical signal for fine-tuning techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). The model learns from datasets of paired comparisons where one output is labeled as preferred over another.
