Publication | Open Access
Agreement, the F-Measure, and Reliability in Information Retrieval
944
Citations
6
References
2005
Year
Information retrieval studies that involve searching the Internet or marking phrases usually lack a well‑defined number of negative cases, preventing the use of traditional interrater reliability metrics like κ and leading researchers to quantify system performance as precision, recall, F‑measure, or agreement. The study aims to evaluate positive specific agreement (or F‑measure) as a metric for interrater reliability in information retrieval tasks. The authors analyze expert pairwise agreement, demonstrating that average F‑measure equals average positive specific agreement and that κ converges to these measures with many negative cases. The analysis shows that the average F‑measure among expert pairs is numerically identical to the average positive specific agreement, and κ approaches these measures as the number of negative cases grows large.
Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the κ statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to the average positive specific agreement among experts and that κ approaches these measures as the number of negative cases grows large. Positive specific agreement—or the equivalent F-measure—may be an appropriate way to quantify interrater reliability and therefore to assess the reliability of a gold standard in these studies.
| Year | Citations | |
|---|---|---|
Page 1
Page 1