Kahneman-Tversky Optimization (KTO) is a machine learning algorithm for aligning language models that directly optimizes a policy using a loss function derived from prospect theory. Unlike methods like Direct Preference Optimization (DPO) that require explicit pairwise comparisons, KTO uses a simpler binary signal—whether a single response is desirable or undesirable—and models the perceived gain or loss relative to a reference point. This makes it more data-efficient and robust to noisy or imbalanced preference labels.
Glossary
Kahneman-Tversky Optimization (KTO)

What is Kahneman-Tversky Optimization (KTO)?
Kahneman-Tversky Optimization (KTO) is a preference optimization algorithm for language models that uses a loss function based on prospect theory from behavioral economics, focusing on deviations from a reference point rather than strict pairwise comparisons.
The algorithm's core innovation is framing alignment as a value-from-reference problem, not a comparison-of-two problem. It treats a desirable response as a gain and an undesirable one as a loss, applying a non-linear transformation from prospect theory where losses are weighted more heavily than gains. This asymmetry helps the model more aggressively avoid generating harmful outputs. KTO eliminates the need for a separate reward model and the complex reinforcement learning loop of methods like Proximal Policy Optimization (PPO), simplifying the alignment pipeline while maintaining strong performance on benchmarks for helpfulness and harmlessness.
KTO vs. DPO vs. RLHF: Key Differences
A technical comparison of three core algorithms used to align language models with human or AI preferences, highlighting their underlying mechanisms, data requirements, and computational trade-offs.
| Feature / Mechanism | Kahneman-Tversky Optimization (KTO) | Direct Preference Optimization (DPO) | Reinforcement Learning from Human Feedback (RLHF) |
|---|---|---|---|
Core Theoretical Basis | Prospect Theory (Kahneman & Tversky) | Bradley-Terry Model & Plackett-Luce | Reinforcement Learning (Policy Gradients) |
Required Training Data Format | Single responses labeled as 'desirable' or 'undesirable' | Strict pairwise comparisons (Chosen vs. Rejected) | Pairwise comparisons for reward model + generations for RL |
Explicit Reward Model Required | |||
Reinforcement Learning Loop | |||
Primary Loss Function | KTO loss (asymmetric, reference-dependent) | DPO loss (implicit reward via Bradley-Terry) | Combined PPO loss + KL penalty (explicit reward) |
Key Hyperparameter | Reference point (implicit in loss) | Beta (controls deviation from reference policy) | Beta (KL penalty) + multiple PPO/LR hyperparameters |
Training Stability | High (single-stage, no RL instability) | High (single-stage, convex objective) | Moderate to Low (two-stage, RL instability risk) |
Computational Complexity | Low (similar to supervised fine-tuning) | Low (similar to supervised fine-tuning) | High (requires reward model training + intensive PPO rollouts) |
Handles Non-Binary Preferences | |||
Mitigates Reward Overoptimization | Yes (via loss asymmetry & reference point) | Yes (via implicit reward & KL constraint) | Yes (via explicit KL penalty, but risk remains) |
Typical Use Case | Aligning with simple good/bad feedback; data-efficient tuning | Standard pairwise preference alignment; simplicity & stability | High-resource, maximal performance alignment with complex rewards |
Frequently Asked Questions
Kahneman-Tversky Optimization (KTO) is a preference optimization algorithm for language models that uses a loss function based on prospect theory from behavioral economics, focusing on deviations from a reference point rather than strict pairwise comparisons.
Kahneman-Tversky Optimization (KTO) is a machine learning algorithm for aligning language models with human or AI preferences that uses a loss function derived from prospect theory, a cornerstone of behavioral economics developed by Daniel Kahneman and Amos Tversky. Unlike methods like Direct Preference Optimization (DPO) that require explicit pairwise comparisons, KTO trains on binary, per-example preference labels (e.g., 'chosen' or 'rejected') by framing the learning objective around gains and losses relative to a reference point, typically the expected value of the policy's output. This allows it to be more data-efficient, as it does not require constructing preference pairs from the same prompt, and it directly optimizes for the utility of a response rather than just its relative ranking.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Kahneman-Tversky Optimization (KTO) is a key technique within the broader field of aligning AI models using feedback. The following terms are essential for understanding its context, mechanisms, and alternatives.
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) is a foundational preference optimization algorithm that directly fine-tunes a language model's policy using pairwise comparison data, bypassing the need for an explicit reward model. It derives its loss function from the Bradley-Terry model of preferences.
- Contrast with KTO: While DPO requires explicit pairs of chosen and rejected responses, KTO uses a simpler binary signal (chosen or not chosen) and incorporates a reference point from prospect theory, making it more robust to imbalanced or noisy preference data.
Reinforcement Learning from AI Feedback (RLAIF)
Reinforcement Learning from AI Feedback (RLAIF) is the overarching paradigm where an AI model (like a large language model) generates the preference or reward signals used to train another model. This scales alignment by reducing reliance on costly human annotation.
- KTO's Role: KTO is a specific optimization algorithm that can operate within an RLAIF pipeline. It uses AI-generated preference judgments to calculate its prospect theory-based loss, aligning the target model without a complex reinforcement learning loop.
Reward Modeling
Reward modeling is the process of training a separate neural network to predict a scalar reward value, typically based on human or AI preferences. This reward model is then used to guide policy optimization via algorithms like Proximal Policy Optimization (PPO).
- KTO's Approach: KTO eliminates the need for training and maintaining a separate reward model. It directly incorporates the preference logic into its loss function, simplifying the alignment stack and avoiding issues like reward overoptimization that can occur when a policy overfits to an imperfect reward model.
Prospect Theory
Prospect Theory, developed by Daniel Kahneman and Amos Tversky, is a behavioral economics model describing how people make decisions under risk. It posits that people evaluate potential losses and gains relative to a reference point, and that losses loom larger than equivalent gains (loss aversion).
- Core of KTO: The KTO loss function is directly derived from prospect theory. It treats a model's output as a 'gain' if it is preferred and a 'loss' if it is dispreferred, relative to the expected value of a reference model's output. This psychological grounding is what differentiates it from purely statistical approaches like DPO.
Preference Dataset
A preference dataset is the curated collection of prompts, model-generated responses, and annotations (human or AI) indicating which response is preferred. It is the fundamental fuel for alignment techniques like DPO, reward modeling, and KTO.
- Data Requirements for KTO: KTO can work with simpler binary preference data (just a 'chosen' response per prompt) and does not strictly require the paired 'chosen vs. rejected' format needed by DPO. This can reduce data collection complexity. Datasets may include synthetic preferences generated by AI judges.
Alignment Tax
Alignment tax refers to the potential degradation of a model's general capabilities (e.g., reasoning diversity, factual knowledge) that can occur as a side effect of alignment procedures aimed at improving safety or helpfulness.
- KTO's Consideration: A key motivation behind developing KTO and similar methods is to minimize the alignment tax. By using a more stable, reference-based loss derived from prospect theory, KTO aims to achieve effective alignment while better preserving the base model's performance, compared to more aggressive reinforcement learning fine-tuning methods.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us