Publication | Closed Access
Information retrieval system evaluation
365
Citations
12
References
2005
Year
Unknown Venue
EngineeringSemantic WebCorpus LinguisticsSocial SciencesText MiningInformation RetrievalData ScienceLanguage TestingWilcoxon TestRelevance FeedbackStatisticsRetrieval TechniqueReliabilityIr SystemsKnowledge RetrievalInformation ManagementSignificance TestsSoftware TestingTest CollectionInteractive Information Retrieval
Information retrieval system effectiveness is typically assessed by comparing performance on shared queries and documents, with significance tests used to gauge reliability, yet prior studies have produced limited or overly stringent results. The study revisits how significance tests should be applied in evaluating information retrieval systems. The authors find that the t‑test is the most reliable significance test for IR evaluation, that past work over‑estimated test error, and that assessor effort would be better spent on larger, less detailed test collections rather than on precision at rank 10 versus mean average precision comparisons.
The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.
| Year | Citations | |
|---|---|---|
Page 1
Page 1