Fleiss' Kappa is a statistical measure used to assess the inter-rater reliability of agreement among a fixed number of raters, annotators, or models when classifying items into categorical scales. It extends Cohen's Kappa to more than two raters and corrects for the level of agreement expected purely by chance. In agentic cognitive architectures, it quantifies the consistency of multiple reasoning paths or model outputs, providing a foundation for self-consistency mechanisms like ensemble averaging or majority voting.
Glossary
Fleiss' Kappa

What is Fleiss' Kappa?
Fleiss' Kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters or models when assigning categorical ratings.
The calculation compares the observed proportion of agreement to the probability of chance agreement, producing a value between -1 and 1. A value of 1 indicates perfect agreement, 0 indicates agreement no better than chance, and negative values suggest systematic disagreement. It is a critical tool for evaluation-driven development, validating the reliability of automated planning systems or the outputs of hierarchical task networks before deploying robust, production-grade agents.
Key Characteristics of Fleiss' Kappa
Fleiss' Kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters or models when assigning categorical ratings. It is a crucial metric for evaluating the consistency of outputs from multiple AI agents or annotators in a system.
Generalization of Cohen's Kappa
Fleiss' Kappa extends Cohen's Kappa, which measures agreement between two raters, to scenarios involving three or more raters. This makes it essential for evaluating ensembles, multi-agent systems, or crowdsourced labeling tasks where more than two sources provide judgments.
- Key Formula: κ = (P̄ - P̄_e) / (1 - P̄_e)
- P̄: The observed proportion of agreement across all raters and items.
- P̄_e: The expected proportion of agreement due to chance, calculated from the marginal distributions of categories across all raters.
Chance-Corrected Agreement
The core value of Fleiss' Kappa is its correction for agreement expected by random chance. A simple percentage agreement can be misleadingly high if categories are imbalanced. Fleiss' Kappa provides a more rigorous score where:
- κ = 1: Perfect agreement beyond chance.
- κ = 0: Agreement equal to what is expected by chance.
- κ < 0: Agreement worse than chance (systematic disagreement). This correction is vital for assessing the genuine consensus in self-consistency mechanisms like ensemble voting or label aggregation.
Fixed Number of Raters
Unlike some other agreement statistics, Fleiss' Kappa assumes a fixed set of raters evaluates all items (or a sample thereof). This structure is common in AI evaluation setups, such as:
- Having a panel of 5 expert models score 100 agent responses.
- Using 3 different truth inference algorithms to label a dataset. The metric does not accommodate scenarios where different items are rated by different, varying subsets of raters, which requires alternative measures like Krippendorff's Alpha.
Categorical Data Requirement
Fleiss' Kappa is designed for nominal (categorical) data where ratings fall into distinct, unordered classes. Examples in AI systems include:
- Intent classification (e.g., 'purchase', 'query', 'complaint').
- Sentiment labels (e.g., 'positive', 'neutral', 'negative').
- Error type categorization in model outputs. It is not suitable for ordinal (ranked) or continuous data without modification. For ordinal ratings, weighted Kappa variants that penalize disagreements by degree of distance are used.
Interpretation Benchmarks
While interpretation depends on context, established benchmarks from Landis & Koch (1977) are commonly cited:
- < 0.00: Poor agreement.
- 0.00 – 0.20: Slight agreement.
- 0.21 – 0.40: Fair agreement.
- 0.41 – 0.60: Moderate agreement.
- 0.61 – 0.80: Substantial agreement.
- 0.81 – 1.00: Almost perfect agreement. In agentic systems, a high Fleiss' Kappa between multiple agent instances indicates reliable, deterministic behavior, which is critical for production-grade deployment.
Relation to Other Consensus Metrics
Fleiss' Kappa is part of a family of agreement and consensus mechanisms. Key distinctions:
- vs. Majority Voting: Fleiss' Kappa measures the reliability of the agreement that voting relies upon. Low Kappa suggests voting may be unstable.
- vs. Truth Inference: Fleiss' Kappa can be used to assess the consistency of the noisy sources before applying a truth inference algorithm like Dawid-Skene.
- vs. Intraclass Correlation Coefficient (ICC): ICC is used for continuous or ordinal data, while Fleiss' Kappa is for categorical data. It is a foundational diagnostic tool before implementing more complex aggregation techniques like weighted consensus or Bayesian Model Averaging.
How Fleiss' Kappa is Calculated
Fleiss' Kappa (κ) is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings. It extends Cohen's Kappa to multiple raters.
The calculation of Fleiss' Kappa involves constructing an agreement matrix from categorical ratings. For each subject and category, you count the number of raters who assigned that category. The key formula computes the observed proportion of agreeing rater pairs and compares it to the proportion expected by chance, using the marginal totals of the matrix to estimate chance agreement.
The final Kappa statistic, ranging from -1 to 1, is interpreted: κ ≤ 0 indicates no agreement beyond chance, 0.01–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement. It is widely used to evaluate inter-rater reliability in fields like medical diagnosis and model output consistency in agent ensembles.
Frequently Asked Questions
Fleiss' Kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters or models when assigning categorical ratings. These questions address its core mechanics, applications, and interpretation for engineers building robust agent systems.
Fleiss' Kappa (κ) is a statistical measure for assessing the inter-rater reliability of agreement among a fixed number of raters (or models) when classifying items into mutually exclusive categories. It extends Cohen's Kappa to more than two raters. The calculation works by comparing the observed proportion of agreement among raters to the proportion of agreement expected by chance. The formula is κ = (P̄ - P̄_e) / (1 - P̄_e), where P̄ is the observed agreement and P̄_e is the agreement expected by chance. A value of 1 indicates perfect agreement, 0 indicates agreement equal to chance, and negative values indicate agreement worse than chance. It is a crucial metric for validating the consistency of multiple AI agents or human annotators in labeling tasks, such as categorizing user intents or evaluating model outputs.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Fleiss' Kappa is a key metric for measuring agreement in multi-rater scenarios. The following concepts are fundamental to understanding and implementing robust consensus and aggregation mechanisms in AI systems.
Cohen's Kappa
Cohen's Kappa is a statistical measure of inter-rater agreement for two raters on a categorical scale, correcting for the level of agreement expected by chance. It is the foundational metric from which Fleiss' Kappa generalizes to multiple raters.
- Key Difference: While Fleiss' Kappa handles a fixed number of raters (≥2), Cohen's Kappa is specifically designed for pairwise agreement.
- Calculation: It compares the observed agreement (pₒ) to the expected agreement by chance (pₑ) using the formula: κ = (pₒ - pₑ) / (1 - pₑ).
- Use Case: Commonly used to evaluate the reliability of two human annotators or to compare a single model's output against a gold-standard label.
Truth Inference
Truth inference is the process of aggregating multiple, potentially noisy labels from different sources (e.g., crowd workers, weak models, or sensors) to estimate a single, reliable 'ground truth' label. It is the practical engineering problem that metrics like Fleiss' Kappa help evaluate.
- Core Challenge: Different sources have varying levels of expertise, bias, and reliability.
- Common Algorithms: Methods include Dawid-Skene, Majority Voting, and more sophisticated probabilistic graphical models that estimate both the true label and each source's confusion matrix.
- Application: Critical for creating high-quality training datasets, evaluating model ensembles, and aggregating outputs in decentralized or multi-agent systems.
Majority Voting
Majority voting (or hard voting) is a fundamental consensus mechanism where the final categorical output is determined by selecting the option chosen by the majority of individual models or agents in an ensemble. It is a simple aggregation method whose reliability can be assessed using Fleiss' Kappa.
- Mechanism: Each model in an ensemble gets one vote; the class with the most votes wins.
- Relation to Kappa: A high Fleiss' Kappa score among ensemble members indicates strong agreement, suggesting that majority voting will produce a stable and reliable result.
- Limitation: Treats all models as equally reliable, which may not be optimal if some models are more accurate than others.
Ensemble Averaging
Ensemble averaging is a self-consistency technique for regression or probability outputs that combines predictions from multiple models by computing their arithmetic mean. While Fleiss' Kappa measures categorical agreement, ensemble averaging is used to produce a final, more stable and accurate continuous prediction.
- Purpose: Reduces variance and mitigates the impact of outliers from any single model.
- Contrast with Voting: Used for numerical outputs, whereas majority voting is for categorical labels.
- Foundation for Uncertainty: The variance across the ensemble's predictions can be used as a simple measure of predictive uncertainty.
Inter-Rater Reliability (IRR)
Inter-rater reliability (IRR) is the broader field of study concerned with the degree of agreement among raters. Fleiss' Kappa is one specific statistic within this field, designed for categorical data with multiple raters.
- Other IRR Metrics: Includes Intraclass Correlation Coefficient (ICC) for continuous data, Krippendorff's Alpha (which handles missing data and is applicable to any level of measurement), and Percent Agreement (which does not correct for chance).
- Engineering Significance: High IRR is a prerequisite for creating trustworthy labeled datasets, which are the foundation for supervised learning. Low IRR signals that annotation guidelines are unclear or the task is inherently ambiguous.
Weighted Consensus
Weighted consensus is an aggregation technique where the contributions of individual models or agents are combined based on assigned weights, typically reflecting their estimated confidence, historical accuracy, or reliability. It is a more sophisticated alternative to simple majority voting.
- Mechanism: The final output is a weighted sum or a weighted vote, where higher-performing models have greater influence.
- Connection to Evaluation: Metrics like Fleiss' Kappa can be calculated per item; items with low agreement (low per-item Kappa) may trigger a fallback to a weighted consensus from a trusted subset of models.
- Application: Central to mixture of experts architectures and federated learning aggregation protocols like Federated Averaging (FedAvg).

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us