An ensemble reward is a technique in reinforcement learning where the predictions of multiple, independently trained reward models are aggregated—typically by averaging or voting—to provide a final reward signal for policy optimization. This method mitigates the risk of overfitting, reward hacking, and single-model failures by leveraging the collective judgment of the ensemble, resulting in a more calibrated and generalizable reward function. It is a form of model averaging applied specifically to the reward modeling component of alignment pipelines like Reinforcement Learning from Human Feedback (RLHF) or Reinforcement Learning from AI Feedback (RLAIF).
