Alignment tax is the potential reduction in a machine learning model's general capabilities—such as creativity, reasoning power, or task versatility—incurred as a side effect of applying alignment techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). This 'tax' is the performance cost paid to make a model more helpful, honest, and harmless according to specified human or AI preferences. The concept highlights a core trade-off in AI safety: optimizing for safety and alignment can sometimes come at the expense of raw capability on unconstrained tasks.
