Pivoted Document Length Normalization

TLDR

Automatic information retrieval systems must handle documents of varying lengths, so document length normalization is used to fairly retrieve documents of all lengths. The study aims to develop pivoted normalization to reduce the gap between relevance and retrieval probabilities and to propose new normalization functions that address shortcomings of the cosine function. Pivoted normalization modifies any normalization function by pivoting around a reference point, using the cosine normalization as a base and introducing pivoted unique and pivoted byte‑size variants. The study demonstrates that pivoted normalization reduces systematic deviations between retrieval and relevance probabilities and generalizes across collections, yielding a robust, collection‑independent normalization technique.

Abstract

Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collectzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function andpresent two new normalization functions--pivoted unique normalization and piuotert byte size normalization.

References

Page 1

	Year	Citations
Term-weighting approaches in automatic text retrieval Gerard Salton, Chris Buckley Information Processing & Management Natural Language ProcessingEngineeringInformation RetrievalIntelligent Information RetrievalComputational Linguistics	1988	9.3K
A vector space model for automatic indexing Gerard Salton, Anita M.-Y. Wong, Chul‐Su Yang Communications of the ACM Natural Language ProcessingEngineeringInformation RetrievalData ScienceData Mining	1975	7.4K
Introduction to modern information retrieval Martín Dillon Information Processing & Management EngineeringInformation RetrievalData MiningIntelligent Information RetrievalKnowledge Discovery	1983	6K
Automatic text processing: the transformation, analysis, and retrieval of information by computer Choice Reviews Online Natural Language ProcessingDocument ProcessingEngineeringInformation RetrievalContent Analysis	1989	3.2K
Okapi at TREC Stephen Robertson, Steve Walker, Susan Jones, Text REtrieval Conference EngineeringIntelligent Information RetrievalCorpus LinguisticsText MiningNatural Language Processing	1994	2.2K
Automatic Query expansion using SMART : TREC 3 Chris Buckley, Gerard Salton, James Allan,	1994	467
Inference Networks for Document Retrieval Howard R. Turtle, W. Bruce Croft ACM SIGIR Forum Natural Language ProcessingInference NetworksEngineeringInformation RetrievalData Science	2017	425
Document Retrieval and Routing Using the INQUERY System. John Broglio, James P. Callan, W. Bruce Croft,	1994	66
Length Normalization in Degraded Text Collections Amit Singhal, Gerard Salton, Chris Buckley eCommons (Cornell University)	1995	66
The importance of proper weighting methods Chris Buckley EngineeringIntelligent Information RetrievalMultiple-criteria Decision AnalysisCorpus LinguisticsStatistical Analysis	1993	55

Page 1