Publication | Closed Access
High-performing feature selection for text classification
43
Citations
0
References
2002
Year
Unknown Venue
EngineeringHigh-performing Feature SelectionFeature SelectionText MiningNatural Language ProcessingClassification MethodInformation RetrievalData ScienceData MiningPattern RecognitionNaive BayesianDocument ClassificationText ClassificationAutomatic ClassificationPredictive AnalyticsKnowledge DiscoveryIntelligent ClassificationComputer ScienceFeature Selection Methods
This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new results comparable to published results. We found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection methods.The results we obtained using only 3% of the available features are among the best reported, including results obtained with the full feature set.