Concepedia

Abstract

Determining appropriate statistical distributions for modeling text corpora is important for accurate estimation of numerical characteristics. Based on the validity of the test on a claim that the data conforms to Poisson distribution we propose Poisson decomposition model (PDM), a statistical model for modeling count data of text corpora, which can straightly capture each document's multidimensional numerical characteristics on topics. In PDM, each topic is represented as a parameter vector with multidimensional Poisson distribution, which can be easily normalized to multinomial term probabilities and each document is represented as measurements on topics and thereby reduced to a measurement vector on topics. We use gradient descent methods and sampling algorithm for parameter estimation. We carry out extensive experiments on the topics produced by our models. The results demonstrate our approach can extract more coherent topics and is competitive in document clustering by using the PDM-based features, compared to PLSI and LDA.

References

YearCitations

Page 1