Direct Preference Optimization (DPO) is a machine learning algorithm that fine-tunes a language model policy directly on a dataset of pairwise comparisons between responses, eliminating the need to train a separate reward model or use reinforcement learning (RL). It derives a closed-form solution by treating the reward function as implicitly defined by the optimal policy under the Bradley-Terry model, optimizing a simple classification loss that increases the likelihood of preferred responses over dispreferred ones. This makes DPO more stable and computationally efficient than methods like Proximal Policy Optimization (PPO).
