Publication | Closed Access
TopCat: data mining for topic identification in a text corpus
121
Citations
45
References
2004
Year
EngineeringRelated ItemsTopic CategoriesSemantic WebFrequent ItemsetsCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsDocument ClassificationLanguage StudiesContent AnalysisDocument ClusteringKnowledge DiscoveryTerminology ExtractionInformation ExtractionTopic ModelKeyword ExtractionLinguistics
TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
| Year | Citations | |
|---|---|---|
Page 1
Page 1