Publication | Closed Access
An information-theoretic measure of term specificity
40
Citations
0
References
1992
Year
EngineeringInverse Document FrequencySemanticsCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsDocument ClassificationLanguage StudiesContent AnalysisStatisticsKnowledge DiscoveryTerminology ExtractionIndex TermDistributional SemanticsInformation ExtractionVector Space ModelInformation StructureKeyword ExtractionLinguisticsTerm Specificity
The inverse document frequency (IDF) and signal-noise ratio (S/N) approaches are two well known term weighting schemes based on term specificity. However, the existing justifications for these methods are still somewhat inconclusive and sometimes even based on incompatible assumptions. Although both methods are related to term specificity, their relationship has not been thoroughly investigated. An information-theoretic measure for term specificity is introduced in this study. It is explicitly shown that the IDF weighting scheme can be derived from the proposed approach by assuming that the frequency of occurrence of each index term is uniform within the set of documents containing the term. The information-theoretic interpretation of term specificity also establishes the relationship between the IDF and S/N methods. © 1992 John Wiley & Sons, Inc.