A safety classifier is a machine learning model, often a smaller neural network fine-tuned separately from a primary language model, that analyzes text to detect and categorize specific types of harmful content. It functions as a critical governance hook within an AI system, scanning inputs and outputs for violations of a safety policy, such as toxicity, violence, unethical advice, or privacy leaks. This enables automated red-teaming and provides a quantitative principle adherence score for each interaction.
