Cohen's Kappa (κ) is a statistical measure of inter-rater reliability for categorical items. It quantifies the agreement between two annotators, judges, or classification models, while explicitly accounting for the agreement that would occur purely by random chance. The metric produces a value between -1 and 1, where 1 indicates perfect agreement, 0 indicates agreement equal to chance, and negative values indicate agreement worse than chance. It is foundational for evaluating label consistency in datasets used to train or benchmark machine learning models.
