Simple Semi-supervised Dependency Parsing

TLDR

The study proposes a simple, effective semi‑supervised method for training dependency parsers. The approach uses lexical representation features that incorporate word clusters derived from a large unannotated corpus. Experiments on the Penn Treebank and Prague Dependency Treebank show that cluster‑based features yield substantial gains, improving English unlabeled second‑order parsing from 92.02 % to 93.16 % and Czech from 86.13 % to 87.13 %, and reducing the amount of supervised data needed by roughly half.

Abstract

We present a simple and effective semisupervised method for training dependency parsers. We focus on the problem of lexical representation, introducing features that incorporate word clusters derived from a large unannotated corpus. We demonstrate the effectiveness of the approach in a series of dependency parsing experiments on the Penn Treebank and Prague Dependency Treebank, and we show that the cluster-based features yield substantial gains in performance across a wide range of conditions. For example, in the case of English unlabeled second-order parsing, we improve from a baseline accuracy of 92.02% to 93.16%, and in the case of Czech unlabeled second-order parsing, we improve from a baseline accuracy of 86.13% to 87.13%. In addition, we demonstrate that our method also improves performance when small amounts of training data are available, and can roughly halve the amount of supervised data required to reach a desired level of performance.

References

Page 1

	Year	Citations

Page 1