The generalized dirichlet distribution in enhanced topic detection

Abstract

We present a new, robust and computationally efficient Hierarchical Bayesian model for effective topic correlation modeling. We model the prior distribution of topics by a Generalized Dirichlet distribution (GD) rather than a Dirichlet distribution as in Latent Dirichlet Allocation (LDA). We define this model as GD-LDA. This framework captures correlations between topics, as in the Correlated Topic Model (CTM) and Pachinko Allocation Model (PAM), and is faster to infer than CTM and PAM. GD-LDA is effective to avoid over-fitting as the number of topics is increased. As a tree model, it accommodates the most important set of topics in the upper part of the tree based on their probability mass. Thus, GD-LDA provides the ability to choose significant topics effectively. To discover topic relationships, we perform hyper-parameter estimation based on Monte Carlo EM Estimation. We provide results using Empirical Likelihood(EL) in 4 public datasets from TREC and NIPS. Then, we present the performance of GD-LDA in ad hoc information retrieval (IR) based on MAP, [email protected], and Discounted Gain. We discuss an empirical comparison of the fitting time. We demonstrate significant improvement over CTM, LDA, and PAM for EL estimation. For all the IR measures, GD-LDA shows higher performance than LDA, the dominant topic model in IR. All these improvements with a small increase in fitting time than LDA, as opposed to CTM and PAM.

References

Page 1

	Year	Citations

Page 1