Publication | Open Access
Overcoming the Lack of Parallel Data in Sentence Compression
139
Citations
30
References
2013
Year
Unknown Venue
Structured PredictionSyntactic ParsingEngineeringMachine LearningMultilingual PretrainingCorpus LinguisticsText MiningNatural Language ProcessingData ScienceComputational LinguisticsSupervised Sentence CompressionCompression CorpusGrammarLanguage StudiesMachine TranslationSentence CompressionSequence ModellingNlp TaskNew SystemComputer ScienceDeep LearningShallow ParsingLinguistics
A major challenge in supervised sentence compression is making use of rich feature representations because of very scarce parallel data. We address this problem and present a method to automatically build a compression corpus with hundreds of thousands of instances on which deletion-based algorithms can be trained. In our corpus, the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence supervised systems which require a structural alignment between the input and output can be successfully trained. We also extend an existing unsupervised compression method with a learning module. The new system uses structured prediction to learn from lexical, syntactic and other features. An evaluation with human raters shows that the presented data harvesting method indeed produces a parallel corpus of high quality. Also, the supervised system trained on this corpus gets high scores both from human raters and in an automatic evaluation setting, significantly outperforming a strong baseline.
| Year | Citations | |
|---|---|---|
Page 1
Page 1