Hierarchical Probabilistic Neural Network Language Model.

TLDR

Neural network language models learn continuous word embeddings that improve generalization but are much slower than n‑gram models for training and recognition. The authors aim to accelerate these models by introducing a hierarchical decomposition of conditional probabilities that speeds training and recognition by roughly 200‑fold. The decomposition employs a binary hierarchical clustering guided by WordNet semantic relations to structure the probability space.

Abstract

In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and provide good generalization even when the number of training examples is insufficient. However, these models are extremely slow in comparison to the more commonly used n-gram models, both for training and recognition. As an alternative to an importance sampling method proposed to speed-up training, we introduce a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.

References

Page 1

	Year	Citations

Page 1