A large annotated corpus for learning natural language inference

TLDR

Entailment and contradiction inference is central to natural language understanding, yet progress has been limited by a lack of large‑scale resources. The authors introduce the Stanford Natural Language Inference corpus, a freely available set of human‑written sentence pairs grounded in image captioning. The corpus contains 570 K annotated pairs derived from a novel grounded image‑captioning task. Its size—two orders of magnitude larger than prior datasets—lets lexicalized classifiers beat sophisticated models and enables neural networks to compete on NLI benchmarks for the first time.

Abstract

Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations. However, machine learning research in this area has been dramatically limited by the lack of large-scale resources. To address this, we introduce the Stanford Natural Language Inference corpus, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. At 570K pairs, it is two orders of magnitude larger than all other resources of its type. This increase in scale allows lexicalized classifiers to outperform some sophisticated existing entailment models, and it allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

References

Page 1

	Year	Citations

Page 1