Online Preference Learning is a dynamic alignment approach where a model's policy is updated continuously based on fresh preference data collected from its most recent interactions, allowing it to adapt to new feedback in real-time. This creates a closed-loop system where the model's outputs generate new data for preference annotation, which is then immediately used for further training. It contrasts with offline preference learning, which uses a static, pre-collected dataset. The core mechanism involves interleaving data collection, annotation, and policy updates in a single, ongoing training process.
