Publication | Closed Access
Pivoted Document Length Normalization
857
Citations
10
References
2017
Year
EngineeringIntelligent Information RetrievalCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsRelevance FeedbackQuery ExpansionMachine TranslationUnique NormalizationDocument Length NormalizationKnowledge DiscoveryData NormalizationComputer ScienceInformation ExtractionRetrieval ProbabilitiesText NormalizationRelevance ProbabilitiesDocument Processing
Automatic information retrieval systems must handle documents of varying lengths, so document length normalization is used to fairly retrieve documents of all lengths. The study aims to develop pivoted normalization to reduce the gap between relevance and retrieval probabilities and to propose new normalization functions that address shortcomings of the cosine function. Pivoted normalization modifies any normalization function by pivoting around a reference point, using the cosine normalization as a base and introducing pivoted unique and pivoted byte‑size variants. The study demonstrates that pivoted normalization reduces systematic deviations between retrieval and relevance probabilities and generalizes across collections, yielding a robust, collection‑independent normalization technique.
Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrievaf probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collectzorz independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function andpresent two new normalization functions--pivoted unique normalization and piuotert byte size normalization.
| Year | Citations | |
|---|---|---|
1988 | 9.3K | |
1975 | 7.4K | |
1983 | 6K | |
1989 | 3.2K | |
1994 | 2.2K | |
1994 | 467 | |
2017 | 425 | |
1994 | 66 | |
1995 | 66 | |
1993 | 55 |
Page 1
Page 1