Agreement, the F-Measure, and Reliability in Information Retrieval

TLDR

Information retrieval studies that involve searching the Internet or marking phrases usually lack a well‑defined number of negative cases, preventing the use of traditional interrater reliability metrics like κ and leading researchers to quantify system performance as precision, recall, F‑measure, or agreement. The study aims to evaluate positive specific agreement (or F‑measure) as a metric for interrater reliability in information retrieval tasks. The authors analyze expert pairwise agreement, demonstrating that average F‑measure equals average positive specific agreement and that κ converges to these measures with many negative cases. The analysis shows that the average F‑measure among expert pairs is numerically identical to the average positive specific agreement, and κ approaches these measures as the number of negative cases grows large.

Abstract

Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the κ statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to the average positive specific agreement among experts and that κ approaches these measures as the number of negative cases grows large. Positive specific agreement—or the equivalent F-measure—may be an appropriate way to quantify interrater reliability and therefore to assess the reliability of a gold standard in these studies.

References

Page 1

	Year	Citations

Page 1