Grammar as a Foreign Language

TLDR

Syntactic constituency parsing is a fundamental NLP problem that has driven intensive research, yet the most accurate parsers remain domain‑specific, complex, and inefficient. The study demonstrates that a domain‑agnostic attention‑enhanced sequence‑to‑sequence model attains state‑of‑the‑art parsing performance when trained on a large synthetic corpus annotated by existing parsers. The model matches standard parsers with only a small human‑annotated dataset, proving high data efficiency, and it processes over a hundred sentences per second on an unoptimized CPU.

Abstract

Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. It also matches the performance of standard parsers when trained only on a small human-annotated dataset, which shows that this model is highly data-efficient, in contrast to sequence-to-sequence models without the attention mechanism. Our parser is also fast, processing over a hundred sentences per second with an unoptimized CPU implementation.

References

Page 1

	Year	Citations

Page 1