Jailbreak detection is a security mechanism that identifies and blocks adversarial user prompts designed to circumvent an AI model's safety filters, ethical guidelines, or operational constraints. It acts as a defensive layer, often implemented as a safety classifier or governance hook, that screens inputs for known attack patterns, semantic manipulations, and policy violations before they reach the core generative model. This process is fundamental to maintaining adversarial robustness and enforcing constitutional guardrails in production systems.
