Cohen's Kappa (κ) is a chance-corrected metric that quantifies the level of agreement between two raters or classifiers on a categorical scale, ranging from -1 to 1. It is calculated as (observed agreement - expected agreement) / (1 - expected agreement). A value of 1 indicates perfect agreement, 0 indicates agreement equal to chance, and negative values suggest systematic disagreement. This correction for chance is its primary advantage over simple percent agreement, making it essential for inter-rater reliability assessment in fields like error classification and confusion matrix analysis.
Glossary
Kappa Statistic (Cohen's Kappa)

What is Kappa Statistic (Cohen's Kappa)?
Cohen's Kappa is a robust statistical measure used to evaluate the agreement between two raters or classification systems, accounting for the agreement expected by chance.
In error detection and classification, Kappa is used to validate the consistency of human annotators or automated systems in categorizing failures, such as distinguishing between Type I and Type II errors or different failure modes. It is a cornerstone of evaluation-driven development, providing a quantitative basis for assessing classifier performance before deployment. When monitoring agentic self-evaluation outputs, a high Kappa score between an agent's self-assessment and a human validator indicates reliable confidence scoring and internal error detection mechanisms.
Interpreting Kappa Values: A Guide
Cohen's Kappa is a robust metric for measuring inter-rater agreement on categorical items, correcting for the agreement expected by chance. This guide explains its calculation, interpretation, and role in evaluating classification systems.
The Core Formula
Cohen's Kappa (κ) is calculated as:
κ = (p₀ - pₑ) / (1 - pₑ)
Where:
- p₀ is the observed proportion of agreement (the raw accuracy).
- pₑ is the expected proportion of agreement due to chance, calculated from the marginal totals of the confusion matrix.
This adjustment for chance is what distinguishes Kappa from simple accuracy, making it a more rigorous measure, especially for imbalanced classes.
Interpretation Scale (Landis & Koch)
A widely cited benchmark for interpreting Kappa values is the scale proposed by Landis and Koch (1977):
- κ ≤ 0.00: Poor agreement
- 0.00 < κ ≤ 0.20: Slight agreement
- 0.20 < κ ≤ 0.40: Fair agreement
- 0.40 < κ ≤ 0.60: Moderate agreement
- 0.60 < κ ≤ 0.80: Substantial agreement
- 0.80 < κ ≤ 1.00: Almost Perfect agreement
Note: Context matters. 'Substantial' agreement (κ > 0.6) is often the minimum threshold for reliable classification in many applied settings.
Kappa vs. Simple Accuracy
Simple accuracy can be misleading, especially with class imbalance. Kappa provides a critical correction.
Example: Two raters classifying 100 items (90 of Class A, 10 of Class B). If they both blindly call everything 'Class A', their accuracy is 90%, but their Kappa is 0.0—correctly indicating no real agreement beyond chance.
Key Insight: High accuracy with low Kappa signals that the model or raters are exploiting a class imbalance, not demonstrating true discriminative power.
Weighted Kappa for Ordinal Data
Standard Cohen's Kappa treats all disagreements equally. Weighted Kappa is used for ordinal categories (e.g., ratings of 'Poor', 'Fair', 'Good', 'Excellent'), where some disagreements are more serious than others.
- It applies a weight matrix (often linear or quadratic) to penalize larger discrepancies more heavily.
- A disagreement between 'Poor' and 'Excellent' receives a greater penalty than between 'Fair' and 'Good'.
- This provides a more nuanced measure of agreement for ranking and severity scales.
Common Pitfalls & Limitations
While powerful, Kappa has limitations to consider:
- Prevalence Effect: Kappa values are influenced by the distribution of categories (prevalence). The same observed agreement can yield different Kappa scores under different marginal distributions.
- Bias Effect: Kappa is also affected by any systematic bias between raters.
- Not a Panacea: A high Kappa indicates reliability, not validity. Raters can reliably agree on an incorrect label.
- Benchmark Dependency: The interpretation scales (like Landis & Koch) are guidelines, not universal statistical truths. Domain-specific thresholds are often necessary.
Application in Agentic Systems
In recursive error correction and agentic self-evaluation, Kappa serves as a key metric for:
- Evaluating Self-Critique: Measuring agreement between an agent's initial output and its own refined output after a correction cycle.
- Agent Consensus: Assessing agreement between multiple agents in a multi-agent system on the classification of a task outcome or error type.
- Validation Pipeline Benchmarking: Quantifying the reliability of automated output validation frameworks or hallucination detection modules against human auditor ground truth.
It provides a statistically grounded measure of improvement in autonomous iterative refinement.
Kappa vs. Other Agreement & Classification Metrics
This table compares Cohen's Kappa to other key metrics used for evaluating agreement between raters and the performance of classification models, highlighting their primary use cases, key properties, and limitations.
| Metric | Cohen's Kappa | Accuracy | F1 Score | Intraclass Correlation Coefficient (ICC) |
|---|---|---|---|---|
Primary Purpose | Measures inter-rater agreement for categorical items, correcting for chance. | Measures the proportion of total predictions (both positive and negative) that were correct. | Balances precision and recall for binary classification, using the harmonic mean. | Measures reliability or agreement for continuous or ordinal data from multiple raters or measurements. |
Chance Correction | ||||
Handles Class Imbalance | Moderate (affected by prevalence) | Strong (designed for imbalance) | Strong (model-dependent) | |
Scale / Output | Scalar value typically between -1 and 1. | Scalar value between 0 and 1. | Scalar value between 0 and 1. | Scalar value typically between 0 and 1. |
Data Type | Categorical (nominal or ordinal). | Categorical. | Categorical (typically binary). | Continuous or ordinal. |
Interpretation of 0 | Agreement equivalent to chance. | All predictions are incorrect. | Either precision or recall is zero. | No reliability among raters. |
Key Limitation | Sensitive to prevalence; high agreement can yield low kappa if one category is very common. | Misleading with imbalanced classes; high accuracy can be achieved by always predicting the majority class. | Designed for binary classification; macro/micro averages needed for multi-class. | Multiple formulations (ICC1, ICC2, ICC3) with different assumptions about rater effects. |
Common Use Case | Assessing reliability of human annotators (e.g., medical diagnosis, content moderation). | Initial, high-level assessment of model performance on balanced datasets. | Evaluating classifiers where both false positives and false negatives are important (e.g., spam detection). | Assessing consistency of measurements (e.g., medical device readings, psychological test scores). |
Key Use Cases for Cohen's Kappa
Cohen's Kappa (κ) is a robust metric for measuring agreement between two raters on categorical data, correcting for chance agreement. Its primary applications are in fields where subjective human judgment must be quantified and validated.
Validating Annotation Quality
In supervised machine learning, Cohen's Kappa is the gold standard for assessing the reliability of labeled training data. Before model training, data scientists calculate κ between multiple human annotators to ensure label consistency.
- A κ > 0.8 indicates excellent agreement, validating the dataset for training.
- A κ between 0.6 and 0.8 suggests substantial agreement, often requiring adjudication for disputed labels.
- Low κ scores signal poor annotation guidelines, necessitating retraining of annotators and refinement of the labeling protocol.
Benchmarking Diagnostic Tests
In medical and psychological diagnostics, κ is used to evaluate the agreement between a new screening tool and an established clinical gold standard, or between clinicians interpreting the same test results.
- For a new AI-based diagnostic tool analyzing medical images, κ quantifies how well its classifications align with expert radiologists.
- This application directly measures clinical validity and is crucial for regulatory approval, as it accounts for the likelihood of agreement by chance in binary or multi-class diagnoses.
Evaluating Model vs. Human Performance
Kappa is used to compare the categorical outputs of a machine learning model against human expert judgments, providing a more nuanced view than simple accuracy.
- This is critical in content moderation, sentiment analysis, and medical coding, where the 'ground truth' is often subjective.
- A high κ score indicates the model's decisions are congruent with human reasoning, not just statistically correct. It answers: 'Does the AI make mistakes in the same ambiguous cases where humans disagree?'
Assessing Survey & Observational Research
Researchers in social sciences, marketing, and usability studies use κ to ensure coding consistency when categorizing open-ended survey responses, behavioral observations, or interview transcripts.
- For example, in a study coding customer service calls for emotion, multiple researchers would code a sample. Cohen's Kappa objectively measures if their application of the codebook (e.g., 'angry', 'satisfied') is consistent.
- This establishes inter-coder reliability, a prerequisite for publishing findings, as it confirms the data analysis is not biased by individual coder interpretation.
Monitoring Drift in Human-in-the-Loop Systems
In continuous learning systems or human-in-the-loop AI, κ can monitor for concept drift in human judgment over time.
- By periodically measuring agreement (κ) between an AI agent's classifications and a human reviewer's audits, a drop in κ may indicate the AI's logic is drifting or that human labeling standards have shifted.
- This provides an operational metric for model performance monitoring that is more sensitive to semantic alignment than pure accuracy, triggering retraining or guideline review.
Comparing Multiple Raters (Fleiss' Kappa)
While Cohen's Kappa is for two raters, its conceptual extension—Fleiss' Kappa—applies the same chance-correction principle to scenarios with three or more raters.
- This is essential in consensus-driven processes like peer review grading, panel-based diagnostic decisions, or aggregating labels from crowd-sourced platforms.
- Fleiss' Kappa provides a single statistic representing the overall reliability of the rating process across the entire group, ensuring the final aggregated labels are robust.
Frequently Asked Questions
Cohen's Kappa is a fundamental metric for evaluating the reliability of categorical classifications, especially in the context of error detection and model validation. These questions address its calculation, interpretation, and application in machine learning systems.
Cohen's Kappa (κ) is a statistical measure of inter-rater agreement for categorical items that corrects for the level of agreement expected by chance. It is calculated using the formula: κ = (p_o - p_e) / (1 - p_e), where p_o is the observed proportion of agreement (the accuracy) and p_e is the hypothetical probability of chance agreement, derived from the marginal totals of the confusion matrix.
For example, if two raters classify 100 items into 'Error' or 'No Error' and their observed agreement is 90% (p_o = 0.90), but the expected chance agreement based on their label distributions is 70% (p_e = 0.70), the Kappa score is (0.90 - 0.70) / (1 - 0.70) = 0.667. This indicates the agreement is substantially better than chance.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Cohen's Kappa is a foundational metric for evaluating classifier agreement. These related concepts provide the statistical and practical context for its use in machine learning evaluation and error analysis.
Confusion Matrix
A confusion matrix is the essential table used to calculate Cohen's Kappa and other classification metrics. It provides the raw counts of true positives, false positives, true negatives, and false negatives from which agreement is measured.
- Foundation for Kappa: The observed agreement (Po) and chance agreement (Pe) in the Kappa formula are derived directly from the marginal totals of the confusion matrix.
- Error Analysis: Beyond Kappa, the matrix allows for granular diagnosis of specific error types made by a model or rater.
Inter-Rater Reliability (IRR)
Inter-Rater Reliability is the broader field of assessing the degree of agreement among two or more independent raters. Cohen's Kappa is a specific, widely-used statistic within this domain.
- Corrects for Chance: Kappa's key differentiator is that it quantifies agreement beyond what is expected by random chance, unlike simple percent agreement.
- Other IRR Metrics: Includes Fleiss' Kappa (for multiple raters), Krippendorff's Alpha (for various data types and missing data), and Intraclass Correlation Coefficient (ICC) (for continuous data).
F1 Score
The F1 Score is the harmonic mean of precision and recall, providing a single metric that balances a classifier's ability to correctly identify positive cases while minimizing false alarms.
- Complementary to Kappa: While Kappa measures agreement/correctness against a ground truth (often correcting for class imbalance), the F1 score focuses specifically on the performance for the positive class.
- Use Case: F1 is preferred when the cost of false positives and false negatives is high and the class distribution is imbalanced. Kappa provides a more general measure of overall classification accuracy adjusted for chance.
Bland-Altman Plot
A Bland-Altman plot is a graphical method for assessing agreement between two different measurement techniques that measure the same continuous variable, contrasting with Kappa's focus on categorical agreement.
- Visualizing Differences: Plots the differences between two measurements against their averages, with limits of agreement (mean difference ± 1.96 SD).
- Context: Used in medical and engineering fields to evaluate a new measurement method against a gold standard. For categorical data, the confusion matrix and Kappa serve an analogous purpose.
Calibration Error
Calibration Error measures the discrepancy between a probabilistic classifier's predicted confidence scores and the true empirical frequencies of outcomes. It assesses whether a model's stated "80% confidence" is correct 80% of the time.
- Reliability of Confidence: While Kappa measures what the classifier predicted (the label), calibration assesses how sure it was about that prediction.
- Critical for Trust: A well-calibrated model with moderate Kappa may be more trustworthy and useful for decision-making than a highly accurate but poorly calibrated one, as its confidence scores are actionable.
Population Stability Index (PSI)
The Population Stability Index is a metric used to monitor the shift or drift in the distribution of a variable (e.g., a model's score or an input feature) between two populations or time periods.
- Monitoring Context: In production ML systems, PSI is used for drift detection. A significant change in input (feature drift) or output (prediction drift) distribution can degrade model performance, which would subsequently be reflected in a dropping Kappa score against new ground truth.
- Proactive vs. Reactive: PSI acts as an early warning signal for potential degradation, while Kappa is a reactive measure of actual performance on evaluated data.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us