Automatically Constructing a Corpus of Sentential Paraphrases.

TLDR

Lack of large‑scale, publicly available sentential paraphrase corpora hampers research in automatic paraphrase identification and generation. The paper presents the Microsoft Research Paraphrase Corpus and discusses guidelines for human raters. The corpus was assembled by extracting candidate paraphrases from topic‑clustered news using heuristic methods and an SVM classifier, then refined through human guideline discussions. Human evaluation confirmed that 67 % of the 5,801 sentence pairs are semantically equivalent paraphrases.

Abstract

An obstacle to research in automatic paraphrase identification and generation is the lack of large-scale, publiclyavailable labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Microsoft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classifier to select likely sentence-level paraphrases from a large corpus of topicclustered news data. These pairs were then submitted to human judges, who confirmed that 67% were in fact semantically equivalent. In addition to describing the corpus itself, we explore a number of issues that arose in defining guidelines for the human raters.

References

Page 1

	Year	Citations

Page 1