Reinforcement Learning from AI Feedback (RLAIF) is a machine learning alignment technique that fine-tunes a model's policy using a reward signal generated by an AI evaluator, not human annotators. This process typically involves an AI critique model that scores or ranks responses based on a predefined constitution—a set of safety, ethical, and helpfulness principles. The goal is to create a scalable, automated feedback loop that steers the policy model toward desired behaviors, such as being harmless and honest, without continuous human intervention.
