Concepedia

Publication | Closed Access

Probabilistic Latent Semantic Indexing

4K

Citations

14

References

2017

Year

TLDR

Probabilistic Latent Semantic Indexing is a novel, statistically grounded approach to automated document indexing that uses a latent class model for factor analysis of count data, contrasting with standard LSI by providing a proper generative data model. The model is fitted from a training corpus using a generalization of the Expectation Maximization algorithm, enabling it to handle domain‑specific synonymy and polysemous words. Retrieval experiments on several test collections show substantial performance gains over direct term matching and over LSI, especially when combining models of different dimensionalities.

Abstract

Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

References

YearCitations

Page 1