Harm classification is the machine learning process of using specialized models, known as safety classifiers, to automatically detect, flag, and categorize potentially harmful, toxic, or unsafe content within AI-generated text or user inputs. It acts as a critical input and output filter, scanning for predefined categories of risk such as violence, hate speech, self-harm, illegal activities, or unethical advice. This automated analysis forms the first line of defense in a Constitutional AI framework, enabling systems to trigger refusal mechanisms or initiate self-critique loops before harmful content is delivered.
