Three new graphical models for statistical language modelling

TLDR

Recent parametric models using distributed representations challenge the dominance of n‑gram models in statistical language modeling by mitigating data sparsity. The authors introduce three probabilistic language models that predict the next word using distributed representations of preceding words. These models jointly learn real‑valued word embeddings and stochastic binary hidden features, using the embeddings and hidden states to predict the next word’s embedding. Adding connections from previous hidden states and between embeddings improves performance, and one model notably outperforms the best n‑gram baselines.

Abstract

The supremacy of n-gram models in statistical language modelling has recently been challenged by parametric models that use distributed representations to counteract the difficulties caused by data sparsity. We propose three new probabilistic language models that define the distribution of the next word in a sequence given several preceding words by using distributed representations of those words. We show how real-valued distributed representations for words can be learned at the same time as learning a large set of stochastic binary hidden features that are used to predict the distributed representation of the next word from previous distributed representations. Adding connections from the previous states of the binary hidden features improves performance as does adding direct connections between the real-valued distributed representations. One of our models significantly outperforms the very best n-gram models.

References

Page 1

	Year	Citations

Page 1