Publication | Open Access
Feature selection and feature extraction for text categorization
580
Citations
13
References
1992
Year
Unknown Venue
EngineeringFeature ExtractionCorpus LinguisticsText MiningNatural Language ProcessingCategory MembershipClassification MethodInformation RetrievalData ScienceData MiningComputational LinguisticsDocument ClassificationLanguage StudiesContent AnalysisAutomatic ClassificationKnowledge DiscoveryIntelligent ClassificationFeature ClusteringGood Categorization PerformanceVector Space ModelLinguistics
The study examined how varying feature selection and syntactic feature extraction influence text categorization on Reuters and MUC‑3 datasets. They employed a statistical classifier with proportional assignment, varying feature counts, and syntactic analysis and clustering to generate features for categorization. They achieved good categorization performance, found that only 10–15 word features sufficed, and that syntactic phrases or clusters were less effective than individual words.
The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features) despite the large training sets. The extraction of new text features by syntactic analysis and feature clustering was investigated on the Reuters data set. Syntactic indexing phrases, clusters of these phrases, and clusters of words were all found to provide less effective representations than individual words.
| Year | Citations | |
|---|---|---|
Page 1
Page 1