Publication | Closed Access
Interpreting TF-IDF term weights as making relevance decisions
800
Citations
72
References
2008
Year
Tf-idf Term WeightsRanking AlgorithmEngineeringQuery ModelCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsRelevance FeedbackNews RecommendationLanguage StudiesRanking FormulaKnowledge DiscoveryTerminology ExtractionVector Space ModelNovel Retrieval ModelLinguistics
The paper introduces a novel probabilistic retrieval model that interprets TF‑IDF term weights as relevance decisions, aiming to unify retrieval theory and guide future advanced term‑weight designs. The model simulates local relevance decisions at each term location, aggregates them into a document‑wide decision, and simplifies to a ranking formula where TF‑IDF weights directly represent the probability of relevance versus nonrelevance. Experiments demonstrate that the model’s term‑frequency component maps to various existing retrieval systems’ factors, that the nonrelevance probability term is mathematically approximated by IDF, and that this relationship holds empirically across four TREC ad hoc datasets.
A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” relevance decisions as the “document-wide” relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity - log p (r¯| t ∈ d ) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by r¯) of term t . Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.
| Year | Citations | |
|---|---|---|
Page 1
Page 1