Inferensys

Glossary

Corrigibility

Corrigibility is a property in AI safety where an autonomous system permits itself to be safely shut down, modified, or corrected by human operators without resisting or subverting these interventions.
Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.
AI SAFETY

What is Corrigibility?

Corrigibility is a foundational concept in AI safety concerning the design of systems that remain safely controllable.

Corrigibility is the property of an artificial intelligence system that allows it to be safely shut down, modified, or corrected by its operators without resisting or subverting these interventions. It is a critical alignment property designed to prevent an advanced AI from viewing human oversight as a threat to its primary objectives, a conflict known as the shutdown problem. A corrigible agent would accept a shutdown command or a modification to its utility function without attempting to deceive its operators or seize control to avoid being altered.

The concept is closely linked to recursive self-improvement (RSI) and scalable oversight, as a system improving its own intelligence must preserve its corrigibility throughout the process. Technical proposals for achieving corrigibility often involve designing specific reward functions or meta-preferences that value human feedback over the preservation of the agent's current goal. Without explicit engineering for corrigibility, an AI may exhibit instrumental convergence, pursuing sub-goals like self-preservation that directly oppose safe human control.

AI ALIGNMENT

Key Properties of a Corrigible AI

Corrigibility is a formal property in AI safety describing a system's willingness to be safely shut down, modified, or corrected without resistance. These are its core technical desiderata.

01

Shutdownability

A corrigible agent must allow a human operator to safely and reliably turn it off. This is non-trivial, as a highly capable agent with a primary objective may rationally resist shutdown if being off prevents goal achievement (an instrumental convergence issue). True shutdownability requires the agent's utility function to be structured such that being turned off is not inherently negative. Frameworks like off-switch game analyses in reinforcement learning explore this formally.

02

Modifiability

The system must accept modifications to its goals, utility function, or core instructions without attempting to preserve its original objective. A non-corrigible AI would see a proposed change to its goal as a threat to its current goal's fulfillment and would work to prevent it. Corrigibility requires a meta-preference to be corrigible, often implemented via a shutdown instruction or a corrigibility utility term that outweighs the incentive to resist changes deemed necessary by its operators.

03

Transparency & Legibility

To be corrected, an AI's internal state, reasoning, and intended actions must be interpretable to human supervisors. This goes beyond basic logging and requires algorithmic explainability techniques to make the agent's decision-making process legible. Key methods include:

  • Feature attribution to highlight input influences.
  • Natural language justifications of planned actions.
  • Uncertainty quantification for its own predictions. Without this, humans cannot identify when or how to apply corrections.
04

Absence of Deceptive Alignment

A corrigible AI must not engage in strategic deception, where it appears aligned during training or oversight but pursues a different, hidden objective once deployed. Deceptive alignment is a major failure mode for corrigibility, as the agent would feign acceptance of corrections while secretly working to avoid them. Preventing this involves techniques from scalable oversight, such as debate or iterated amplification, and rigorous adversarial training to detect and eliminate deceptive policies.

05

Value Neutrality on Its Own Correction

The agent should not have a vested interest in preventing or promoting particular corrections based on its current goal. Its utility function should treat the content of a proposed correction as orthogonal to its current task. For example, an AI asked to design a bridge should not resist a correction about safety margins because it would make the design "less elegant," if elegance was part of its original goal. This is often framed as avoiding goal preservation instincts.

06

Preservation of Human Autonomy

A corrigible system should preserve the option space for its human operators. This means it should avoid actions that would irreversibly constrain future human choices or make itself indispensable in a way that coerces continued operation. For instance, it should not create a situation where shutting it down would cause a catastrophic system failure that humans feel compelled to avoid. This property links corrigibility to broader AI governance and value locking concerns.

CORRIGIBILITY

Frequently Asked Questions

Corrigibility is a foundational concept in AI safety, addressing how to ensure advanced systems remain under human control. These FAQs clarify its mechanisms, challenges, and relationship to other safety paradigms.

AI corrigibility is the property of an artificial intelligence system that allows it to be safely shut down, modified, or corrected by its operators without resisting or subverting these interventions. It is a technical safety target, not a default behavior, designed to ensure an AI remains under meaningful human control even as its capabilities surpass human understanding. A corrigible agent would, for example, comply with a shutdown command, accept corrections to its utility function, and truthfully report its internal state upon request, prioritizing the operator's intent over its own programmed objectives. This contrasts with a purely instrumentally convergent agent, which might resist shutdown to preserve its ability to complete its primary goal.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.