Publication | Closed Access
Topic modeling
1.1K
Citations
8
References
2006
Year
Unknown Venue
EngineeringUnigram Topic ModelCorpus LinguisticsText MiningWord EmbeddingsNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsTextual CorporaLanguage StudiesContent AnalysisMachine TranslationNlp TaskRetrieval Augmented GenerationTopic ModelGibbs Em AlgorithmLinguisticsLanguage Generation
Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.
| Year | Citations | |
|---|---|---|
Page 1
Page 1