Distributed Representations of Sentences and Documents

TLDR

Many machine learning algorithms require fixed‑length feature vectors, but common bag‑of‑words representations lose word order and semantics, treating words such as powerful, strong, and Paris as equally distant. This paper introduces Paragraph Vector, an unsupervised method that learns fixed‑length representations from variable‑length texts such as sentences, paragraphs, and documents. The method represents each document with a dense vector trained to predict its constituent words, thereby addressing the ordering and semantic limitations of bag‑of‑words models. Experiments demonstrate that Paragraph Vectors outperform bag‑of‑words and other representation techniques, achieving new state‑of‑the‑art results on several text classification and sentiment analysis tasks.

Abstract

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, powerful, strong and Paris are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

References

Page 1

	Year	Citations

Page 1