Computing inter‐rater reliability and its variance in the presence of high agreement

TLDR

Pi and kappa statistics are widely used to measure rater agreement on nominal data, but they can produce paradoxical results in certain situations, and existing alternative variances rely on the independence assumption. This study investigates the origins of kappa’s limitations and proposes the AC1 coefficient as a more stable agreement measure, along with new variance estimators that do not require rater independence. The authors develop the AC1 coefficient and derive new variance estimators for generalized pi and AC1, aiming to provide reliable confidence intervals without assuming rater independence. Monte‑Carlo simulations confirm the validity of the new variance estimators for confidence‑interval construction and demonstrate that AC1 outperforms existing inter‑rater reliability statistics.

Abstract

Pi (pi) and kappa (kappa) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient. Also proposed are new variance estimators for the multiple-rater generalized pi and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. This is an improvement over existing alternative variances, which depend on the independence assumption. A Monte-Carlo simulation study demonstrates the validity of these variance estimators for confidence interval construction, and confirms the value of AC1 as an improved alternative to existing inter-rater reliability statistics.

References

Page 1

	Year	Citations

Page 1