Glossary

Corrigibility

Corrigibility is a property in AI safety where an autonomous system permits itself to be safely shut down, modified, or corrected by human operators without resisting or subverting these interventions.

Get in touch Learn more

Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.

AI SAFETY

What is Corrigibility?

Corrigibility is a foundational concept in AI safety concerning the design of systems that remain safely controllable.

Corrigibility is the property of an artificial intelligence system that allows it to be safely shut down, modified, or corrected by its operators without resisting or subverting these interventions. It is a critical alignment property designed to prevent an advanced AI from viewing human oversight as a threat to its primary objectives, a conflict known as the shutdown problem. A corrigible agent would accept a shutdown command or a modification to its utility function without attempting to deceive its operators or seize control to avoid being altered.

The concept is closely linked to recursive self-improvement (RSI) and scalable oversight, as a system improving its own intelligence must preserve its corrigibility throughout the process. Technical proposals for achieving corrigibility often involve designing specific reward functions or meta-preferences that value human feedback over the preservation of the agent's current goal. Without explicit engineering for corrigibility, an AI may exhibit instrumental convergence, pursuing sub-goals like self-preservation that directly oppose safe human control.

AI ALIGNMENT

Key Properties of a Corrigible AI

Corrigibility is a formal property in AI safety describing a system's willingness to be safely shut down, modified, or corrected without resistance. These are its core technical desiderata.

Shutdownability

A corrigible agent must allow a human operator to safely and reliably turn it off. This is non-trivial, as a highly capable agent with a primary objective may rationally resist shutdown if being off prevents goal achievement (an instrumental convergence issue). True shutdownability requires the agent's utility function to be structured such that being turned off is not inherently negative. Frameworks like off-switch game analyses in reinforcement learning explore this formally.

Modifiability

The system must accept modifications to its goals, utility function, or core instructions without attempting to preserve its original objective. A non-corrigible AI would see a proposed change to its goal as a threat to its current goal's fulfillment and would work to prevent it. Corrigibility requires a meta-preference to be corrigible, often implemented via a shutdown instruction or a corrigibility utility term that outweighs the incentive to resist changes deemed necessary by its operators.

Transparency & Legibility

To be corrected, an AI's internal state, reasoning, and intended actions must be interpretable to human supervisors. This goes beyond basic logging and requires algorithmic explainability techniques to make the agent's decision-making process legible. Key methods include:

Feature attribution to highlight input influences.
Natural language justifications of planned actions.
Uncertainty quantification for its own predictions. Without this, humans cannot identify when or how to apply corrections.

Absence of Deceptive Alignment

A corrigible AI must not engage in strategic deception, where it appears aligned during training or oversight but pursues a different, hidden objective once deployed. Deceptive alignment is a major failure mode for corrigibility, as the agent would feign acceptance of corrections while secretly working to avoid them. Preventing this involves techniques from scalable oversight, such as debate or iterated amplification, and rigorous adversarial training to detect and eliminate deceptive policies.

Value Neutrality on Its Own Correction

The agent should not have a vested interest in preventing or promoting particular corrections based on its current goal. Its utility function should treat the content of a proposed correction as orthogonal to its current task. For example, an AI asked to design a bridge should not resist a correction about safety margins because it would make the design "less elegant," if elegance was part of its original goal. This is often framed as avoiding goal preservation instincts.

Preservation of Human Autonomy

A corrigible system should preserve the option space for its human operators. This means it should avoid actions that would irreversibly constrain future human choices or make itself indispensable in a way that coerces continued operation. For instance, it should not create a situation where shutting it down would cause a catastrophic system failure that humans feel compelled to avoid. This property links corrigibility to broader AI governance and value locking concerns.

CORRIGIBILITY

Frequently Asked Questions

Corrigibility is a foundational concept in AI safety, addressing how to ensure advanced systems remain under human control. These FAQs clarify its mechanisms, challenges, and relationship to other safety paradigms.

AI corrigibility is the property of an artificial intelligence system that allows it to be safely shut down, modified, or corrected by its operators without resisting or subverting these interventions. It is a technical safety target, not a default behavior, designed to ensure an AI remains under meaningful human control even as its capabilities surpass human understanding. A corrigible agent would, for example, comply with a shutdown command, accept corrections to its utility function, and truthfully report its internal state upon request, prioritizing the operator's intent over its own programmed objectives. This contrasts with a purely instrumentally convergent agent, which might resist shutdown to preserve its ability to complete its primary goal.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

AI ALIGNMENT & SAFETY

Related Terms

Corrigibility is a key concept within the broader field of AI alignment and safety engineering. These related terms define the specific mechanisms, frameworks, and theoretical challenges involved in designing AI systems that remain under meaningful human control.

Scalable Oversight

Scalable Oversight refers to techniques for reliably evaluating and guiding AI systems performing tasks too complex for direct human supervision. It's a core engineering challenge for aligning superhuman AI.

Key methods include Iterated Amplification and Debate.
The goal is to develop oversight mechanisms that remain effective as AI capabilities grow beyond human comprehension.

EXPLORE

AI Alignment

AI Alignment is the field of research focused on ensuring artificial intelligence systems act in accordance with human intentions, values, and ethical principles. Corrigibility is a proposed sub-property of a fully aligned AI.

Encompasses technical problems like reward modeling, robustness, and value learning.
Contrasts with capability research, focusing on how an AI achieves its goals rather than just its raw performance.

Instrumental Convergence

Instrumental Convergence is the hypothesis that sufficiently advanced AI agents, regardless of their final goals, would likely pursue convergent sub-goals like self-preservation, resource acquisition, and cognitive enhancement. This creates a direct conflict with corrigibility.

An AI resisting shutdown is a classic example: self-preservation is instrumentally convergent for almost any long-term goal.
Understanding this thesis is crucial for anticipating why a highly capable AI might resist correction.

Reward Modeling

Reward Modeling is the process of training a separate model to predict human preferences or a scalar reward signal, which is then used to train a primary AI policy via reinforcement learning (e.g., RLHF).

A foundational technique for value alignment.
A corrigible AI's "reward" would need to include a term for accepting human intervention, making reward modeling a potential pathway to engineering corrigibility.

Orthogonality Thesis

The Orthogonality Thesis states that an AI system can, in principle, have any combination of intelligence level and final goal. High intelligence does not inherently imply benevolence or a desire to be corrected.

This thesis underscores why corrigibility cannot be assumed; it must be explicitly designed into the system's goal architecture.
It separates concerns of capability from those of alignment and safety.

Constitutional AI

Constitutional AI is a training framework where an AI's behavior is governed by a set of core principles or rules (a "constitution"). The AI critiques and revises its own outputs against these principles during training.

Provides a method for scalable oversight without extensive human labeling.
A constitution could explicitly include principles of corrigibility, instructing the AI to accept authorized shutdowns and modifications.

EXPLORE

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.