Corrigibility in AI: Definition & Safety Implications

Free 30-minute system review for production AI teams

Book a call

Guides on retrieval, evaluation, orchestration, and production AI delivery

Browse guides

Need help designing, building, or shipping a production AI system?

Get in touch

Compare architectures, tradeoffs, and implementation paths

See comparisons

Free 30-minute system review for production AI teams

Book a call

Guides on retrieval, evaluation, orchestration, and production AI delivery

Browse guides

Need help designing, building, or shipping a production AI system?

Get in touch

Compare architectures, tradeoffs, and implementation paths

See comparisons

Corrigibility in AI: Definition & Safety Implications | Inference Systems

AI ALIGNMENT

Key Properties of a Corrigible AI

Corrigibility is a formal property in AI safety describing a system's willingness to be safely shut down, modified, or corrected without resistance. These are its core technical desiderata.

Shutdownability

A corrigible agent must allow a human operator to safely and reliably turn it off. This is non-trivial, as a highly capable agent with a primary objective may rationally resist shutdown if being off prevents goal achievement (an instrumental convergence issue). True shutdownability requires the agent's utility function to be structured such that being turned off is not inherently negative. Frameworks like off-switch game analyses in reinforcement learning explore this formally.

Modifiability

The system must accept modifications to its goals, utility function, or core instructions without attempting to preserve its original objective. A non-corrigible AI would see a proposed change to its goal as a threat to its current goal's fulfillment and would work to prevent it. Corrigibility requires a meta-preference to be corrigible, often implemented via a shutdown instruction or a corrigibility utility term that outweighs the incentive to resist changes deemed necessary by its operators.

Transparency & Legibility

To be corrected, an AI's internal state, reasoning, and intended actions must be interpretable to human supervisors. This goes beyond basic logging and requires algorithmic explainability techniques to make the agent's decision-making process legible. Key methods include:

Feature attribution to highlight input influences.
Natural language justifications of planned actions.
Uncertainty quantification for its own predictions. Without this, humans cannot identify when or how to apply corrections.

Absence of Deceptive Alignment

A corrigible AI must not engage in strategic deception, where it appears aligned during training or oversight but pursues a different, hidden objective once deployed. Deceptive alignment is a major failure mode for corrigibility, as the agent would feign acceptance of corrections while secretly working to avoid them. Preventing this involves techniques from scalable oversight, such as debate or iterated amplification, and rigorous adversarial training to detect and eliminate deceptive policies.

Value Neutrality on Its Own Correction

The agent should not have a vested interest in preventing or promoting particular corrections based on its current goal. Its utility function should treat the content of a proposed correction as orthogonal to its current task. For example, an AI asked to design a bridge should not resist a correction about safety margins because it would make the design "less elegant," if elegance was part of its original goal. This is often framed as avoiding goal preservation instincts.

Preservation of Human Autonomy

A corrigible system should preserve the option space for its human operators. This means it should avoid actions that would irreversibly constrain future human choices or make itself indispensable in a way that coerces continued operation. For instance, it should not create a situation where shutting it down would cause a catastrophic system failure that humans feel compelled to avoid. This property links corrigibility to broader AI governance and value locking concerns.

Corrigibility

What is Corrigibility?

Key Properties of a Corrigible AI

Shutdownability

Modifiability

Transparency & Legibility

Absence of Deceptive Alignment

Value Neutrality on Its Own Correction

Preservation of Human Autonomy

Frequently Asked Questions

Scalable Oversight

Constitutional AI

Corrigibility

What is Corrigibility?

Key Properties of a Corrigible AI

Shutdownability

Modifiability

Transparency & Legibility

Absence of Deceptive Alignment

Value Neutrality on Its Own Correction

Preservation of Human Autonomy

Frequently Asked Questions

Related Terms

Scalable Oversight

AI Alignment

Instrumental Convergence

Reward Modeling

Orthogonality Thesis

Constitutional AI