Safety fine-tuning is a targeted supervised fine-tuning process where a pre-trained language model is further trained on datasets explicitly designed to instill safe, ethical, and compliant behavior. This process adapts the model's weights to improve its refusal mechanisms for harmful requests, reduce the generation of toxic or biased content, and align its outputs with a defined set of constitutional principles or safety policies. It is a core technical component of Constitutional AI and value alignment pipelines.
