Cohen's Kappa (κ) is a statistic that measures inter-rater agreement for categorical items, explicitly adjusting for the level of agreement that would occur purely by random chance. Unlike simple percent agreement, it provides a more reliable assessment of rater reliability, making it a cornerstone metric in fields like medical diagnosis, content moderation, and classifier evaluation against a human-labeled ground truth. It is calculated from the observed agreement and expected agreement in a confusion matrix.
Glossary
Cohen's Kappa

What is Cohen's Kappa?
Cohen's Kappa is a robust statistical measure for assessing the agreement between two raters on a categorical scale, correcting for the agreement expected by chance alone.
The resulting kappa coefficient ranges from -1 to 1, where 1 indicates perfect agreement, 0 indicates agreement equal to chance, and negative values indicate agreement worse than chance. It is particularly valuable for evaluating annotator consistency in dataset creation and for benchmarking machine learning classifier performance, as it reveals the true signal beyond random guessing. Common benchmarks interpret κ > 0.8 as excellent agreement and κ < 0.4 as poor agreement.
Interpreting Cohen's Kappa Values
Cohen's Kappa (κ) quantifies the level of agreement between two raters, correcting for the agreement expected by chance. Its value ranges from -1 to 1, with specific thresholds providing standardized interpretations of reliability.
The Core Interpretation Scale
Cohen's Kappa values are interpreted using a standardized scale that categorizes the strength of agreement beyond chance.
- κ ≤ 0: Indicates no agreement or agreement worse than random chance.
- 0.01 – 0.20: Slight agreement. Agreement is minimal and largely attributable to chance.
- 0.21 – 0.40: Fair agreement. There is a noticeable but weak level of agreement.
- 0.41 – 0.60: Moderate agreement. A substantial, middling level of agreement is present.
- 0.61 – 0.80: Substantial agreement. Strong and reliable agreement between raters.
- 0.81 – 1.00: Almost perfect agreement. Near-complete consensus, with minimal disagreement.
This scale, originally proposed by Landis & Koch (1977), provides a common framework for benchmarking classifier performance against a human or expert baseline.
Chance Correction: The Key Differentiator
Cohen's Kappa's primary value over simple percent agreement is its correction for chance agreement. The formula is:
κ = (Po - Pe) / (1 - Pe)
Where:
- Po is the observed proportion of agreement (like simple accuracy).
- Pe is the expected proportion of agreement due to chance, calculated from the marginal totals of the confusion matrix.
Example: If two raters classifying 'Cat' vs. 'Dog' both simply labeled everything 'Cat' due to class imbalance, percent agreement (Po) could be high (e.g., 90%). However, Pe would also be very high. Kappa subtracts this expected chance agreement (Pe) from both the numerator and denominator, yielding a low κ that correctly reflects the lack of meaningful consensus. This makes it crucial for imbalanced datasets where a naive classifier can achieve high accuracy by always predicting the majority class.
Contextual Interpretation & Limitations
While the standard scale is a useful heuristic, interpreting Kappa requires context.
- Prevalence Effect: Kappa values can be paradoxically low even with high observed agreement if there is a severe class imbalance (high prevalence of one category). Alternative metrics like Prevalence-Adjusted Bias-Adjusted Kappa (PABAK) may be considered.
- Bias Index: If raters have systematically different tendencies (e.g., Rater A is consistently stricter than Rater B), it affects the marginal totals and thus Pe, influencing κ.
- Number of Categories: Kappa is designed for nominal categories. For ordinal data, Weighted Kappa should be used, which assigns partial credit for near-misses (e.g., rating '3' vs. '4' on a 5-point scale is a smaller disagreement than '1' vs. '5').
Kappa is a measure of reliability, not validity. High agreement does not guarantee the raters are correct, only that they are consistent.
Application in AI Evaluation
In machine learning, Cohen's Kappa is a cornerstone metric for evaluation-driven development, particularly when assessing a model against a 'gold standard' human annotation.
Primary Use Case: Evaluating a classifier's performance on tasks where human judgment is the benchmark, such as:
- Sentiment analysis (Positive/Neutral/Negative)
- Medical image diagnosis (Disease Present/Absent)
- Content moderation (Safe/Unsafe)
- Named Entity Recognition (correct span and type classification)
Benchmarking: A κ score above 0.8 (Almost Perfect) against expert annotators is often a target for production-grade models, indicating the AI's decisions are highly aligned with human expertise. It is commonly reported alongside precision, recall, and F1 score to provide a complete picture of classifier performance that accounts for chance.
Comparison with Related Metrics
Understanding when to use Kappa versus other agreement or performance metrics is critical.
- vs. Percent Agreement: Percent agreement ignores chance. Kappa is always more conservative and appropriate for scientific reporting.
- vs. Accuracy: Accuracy is analogous to Po (observed agreement) in a binary classification confusion matrix. Kappa provides the chance-corrected version, making it superior for imbalanced classes.
- vs. F1 Score: The F1 score balances precision and recall for a single model's output against ground truth. Kappa measures agreement between two raters (model vs. human), incorporating the idea that the 'ground truth' itself may have inherent subjectivity.
- vs. Intraclass Correlation (ICC): ICC is used for continuous or ordinal data from multiple raters and assesses agreement based on variance components. Choose Kappa for categorical data and ICC for measuring reliability of continuous scores.
Calculating Kappa from a Confusion Matrix
The calculation is straightforward from a 2x2 or larger confusion matrix. For two raters (Model and Human) with two categories:
| Human: Yes | Human: No | Total | |
|---|---|---|---|
| Model: Yes | a (True Pos) | b (False Pos) | a+b |
| Model: No | c (False Neg) | d (True Neg) | c+d |
| Total | a+c | b+d | N |
- Observed Agreement (Po): (a + d) / N
- Chance Agreement (Pe): Calculate the probability both would randomly say 'Yes' plus the probability both would randomly say 'No'.
- Pe = [ ((a+b)/N) * ((a+c)/N) ] + [ ((c+d)/N) * ((b+d)/N) ]
- Apply Formula: κ = (Po - Pe) / (1 - Pe)
Example: If a=45, b=15, c=10, d=30 (N=100), then Po=0.75, Pe≈0.545, yielding κ ≈ 0.45 (Moderate agreement). This demonstrates how Kappa discounts the agreement that occurred simply because both raters frequently said 'Yes' or 'No'.
Cohen's Kappa vs. Accuracy: Key Differences
A comparison of two core classification metrics, highlighting when to use the chance-corrected Cohen's Kappa versus the simpler Accuracy score.
| Metric / Feature | Cohen's Kappa (κ) | Accuracy |
|---|---|---|
Core Definition | Measures inter-rater agreement for categorical items, correcting for agreement expected by chance. | Measures the proportion of total correct predictions (both true positives and true negatives) out of all predictions. |
Mathematical Correction | Explicitly accounts for and subtracts the probability of chance agreement between the classifier and the ground truth. | No correction for chance agreement; treats all correct predictions equally. |
Interpretation Range | -1 to 1, where ≤ 0 indicates no agreement beyond chance, 0.01–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.0 almost perfect. | 0 to 1 (or 0% to 100%), where 1.0 indicates all predictions are correct. |
Handling of Class Imbalance | Robust. Remains informative even when one class dominates the dataset, as chance agreement is factored into the calculation. | Misleading. Can be artificially high on imbalanced datasets (e.g., 95% accuracy if the model always predicts the majority class). |
Primary Use Case | Assessing classifier performance against a human or expert baseline (inter-rater reliability). Evaluating agreement in categorical labeling tasks. | Providing an intuitive, initial performance check on balanced datasets where all error types are considered equally costly. |
Information from Confusion Matrix | Utilizes all cells (TP, TN, FP, FN) to calculate observed and chance agreement. | Utilizes the diagonal (TP, TN) and the total sum of all cells. |
Sensitivity to Error Cost | Implicitly weights errors based on the prevalence of classes through the chance agreement term. | Assumes all errors (false positives and false negatives) have equal cost. |
Recommended For | Imbalanced datasets, expert validation studies, medical diagnostics, content moderation systems, and any task where a human baseline exists. | Preliminary analysis on balanced datasets, educational contexts, and when a simple, interpretable metric is required for stakeholders. |
Frequently Asked Questions
Cohen's Kappa is a core statistic for evaluating categorical agreement, particularly in classifier assessment and human annotation tasks. These questions address its calculation, interpretation, and practical application in machine learning workflows.
Cohen's Kappa (κ) is a statistic that measures the level of agreement between two raters for categorical items, correcting for the agreement expected by chance. It works by comparing the observed proportion of agreement (Po) with the probability of agreement due to random chance (Pe), using the formula κ = (Po - Pe) / (1 - Pe). A value of 1 indicates perfect agreement, 0 indicates agreement equal to chance, and negative values indicate agreement worse than chance. This correction for chance is its primary advantage over simple percent agreement, making it a robust metric for tasks like assessing classifier performance against a human-labeled ground truth or measuring inter-annotator agreement in data labeling pipelines.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Cohen's Kappa is a cornerstone metric for evaluating agreement, but it exists within a broader ecosystem of statistical measures. These related terms provide the context for when and why to use Kappa versus other metrics.
Inter-Rater Reliability
Inter-rater reliability is the broader concept of measuring the degree of agreement among two or more independent raters or observers. It assesses the consistency of human judgment, which is often the 'ground truth' for training and evaluating classifiers.
- Purpose: To quantify the inherent subjectivity in a labeling task before trusting a model's outputs.
- Key Methods: Includes Cohen's Kappa, Fleiss' Kappa (for >2 raters), and Intraclass Correlation Coefficient (for continuous data).
- Example: If human annotators only agree 70% of the time on a sentiment labeling task, a model achieving 75% accuracy may be performing near the practical ceiling.
Confusion Matrix
A confusion matrix is the fundamental tabular layout used to calculate Cohen's Kappa and many other classification metrics. It provides the raw counts of predictions versus actual labels.
- Structure: Rows represent true classes, columns represent predicted classes. Cells contain counts for True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN).
- Relationship to Kappa: The observed agreement (Po) is calculated as (TP + TN) / Total. The expected chance agreement (Pe) is derived from the row and column marginal totals of this matrix.
- Foundation: Metrics like Accuracy, Precision, Recall, and F1 Score are all computed directly from the confusion matrix.
Accuracy
Accuracy is the simplest classification metric: the proportion of total correct predictions. Cohen's Kappa is often preferred because it corrects for Accuracy's major flaw.
- Formula: (TP + TN) / Total Predictions.
- The Chance Problem: On an imbalanced dataset (e.g., 95% negative, 5% positive), a naive 'always negative' classifier would achieve 95% accuracy, misleadingly suggesting high performance. This is the agreement expected by chance.
- Kappa's Adjustment: Cohen's Kappa explicitly subtracts this chance agreement, providing a more realistic performance score. A Kappa of 0 indicates performance no better than chance, regardless of the raw accuracy.
Fleiss' Kappa
Fleiss' Kappa is a generalization of Cohen's Kappa used to measure agreement among three or more raters, when each rater classifies each item into mutually exclusive categories.
- Use Case: Common in medical research or content moderation where multiple experts label the same data.
- Key Difference: While Cohen's Kappa is calculated from a 2x2 matrix for two raters, Fleiss' Kappa uses a matrix of subjects (rows) by categories (columns), with cell entries indicating how many raters assigned that subject to that category.
- Interpretation: Uses the same scale as Cohen's Kappa (≤0: poor, 0.01-0.20: slight, 0.21-0.40: fair, 0.41-0.60: moderate, 0.61-0.80: substantial, 0.81-1.00: almost perfect).
Weighted Kappa
Weighted Kappa is an extension of Cohen's Kappa used for ordinal categories (e.g., 'Low', 'Medium', 'High'), where some disagreements are more serious than others.
- Core Idea: Not all misclassifications are equal. Predicting 'High' when the truth is 'Medium' is a less severe error than predicting 'High' when the truth is 'Low'.
- Weighting Matrix: Uses a pre-defined matrix (often linear or quadratic weights) to penalize disagreements based on the distance between categories.
- Application: Essential for tasks like severity scoring in medicine, sentiment intensity analysis, or Likert-scale survey analysis, where the ordinal relationship between classes must be preserved in the evaluation.
Intraclass Correlation Coefficient (ICC)
The Intraclass Correlation Coefficient is a reliability statistic used for continuous or ordinal data measured by multiple raters. It is a key alternative to Kappa for non-categorical data.
- Data Type: Used when raters provide scores on a continuous scale (e.g., rating image quality from 1-100) or ordered ratings.
- Model Variants: Different ICC forms (ICC(1), ICC(2,k), etc.) account for whether raters are a random sample or fixed, and whether absolute agreement or consistency is measured.
- Comparison to Kappa: While Kappa is for nominal (unordered) categories, ICC is designed for measurements where the magnitude of difference between ratings is meaningful. It assesses both correlation and agreement in the units of measurement.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us