Concepedia

TLDR

Automatic information retrieval systems must handle documents of varying lengths, so document length normalization is used to fairly retrieve documents of all lengths. The study aims to develop pivoted normalization to reduce the gap between relevance and retrieval probabilities and to propose new normalization functions that address shortcomings of the cosine function. Pivoted normalization modifies any normalization function by pivoting around a reference point, using the cosine normalization as a base and introducing pivoted unique and pivoted byte‑size variants. The study demonstrates that pivoted normalization reduces systematic deviations between retrieval and relevance probabilities and generalizes across collections, yielding a robust, collection‑independent normalization technique.

Abstract

Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collectzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function andpresent two new normalization functions--pivoted unique normalization and piuotert byte size normalization.

References

YearCitations

1988

9.3K

1975

7.4K

1983

6K

1989

3.2K

1994

2.2K

1994

467

2017

425

1994

66

1995

66

1993

55

Page 1