Discovering Diverse and Salient Threads in Document Collections

Abstract

We propose a novel probabilistic technique for modeling and extracting salient struc-ture from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of infor-mation. We are particularly interested in revealing and exploiting relationships be-tween documents. To this end, we focus on extracting diverse sets of threads—singly-linked, coherent chains of important doc-uments. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges. 1

References

Page 1

	Year	Citations

Page 1