Publication | Closed Access
Document preprocessing for naive Bayes classification and clustering with mixture of multinomials
25
Citations
8
References
2004
Year
Unknown Venue
EngineeringCorpus LinguisticsText MiningNatural Language ProcessingClassification MethodInformation RetrievalData ScienceData MiningPattern RecognitionComputational LinguisticsDocument ClassificationStatisticsDocument ClusteringNaive Bayes ClassifierAutomatic ClassificationNaive BayesKnowledge DiscoveryIntelligent ClassificationComputer ScienceHeuristic Feature TransformationsNaive Bayes ClassificationClassification
Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the probabilistic mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong independence assumptions that these models make, their attractiveness come from low computational cost, relatively low memory consumption, ability to handle heterogeneous features and multiple classes, and often competitiveness with the top of the line models. Recently, there has been several attempts to alleviate the problems of Naive Bayes by performing heuristic feature transformations, such as IDF, normalization by the length of the documents and taking the logarithms of the counts. We justify the use of these techniques and apply them to two problems: classification of products in Yahoo! Shopping and clustering the vectors of collocated terms in user queries to Yahoo! Search. The experimental evaluation allows us to draw conclusions about the promise that these transformations carry with regard to alleviating the strong assumptions of the multinomial model.
| Year | Citations | |
|---|---|---|
Page 1
Page 1