BLEU deconstructed: Designing a Better MT Evaluation Metric

Abstract

BLEU is the de facto standard automatic evaluation met-ric in machine translation. While BLEU is undeniably useful, it has a number of limitations. Although it works well for large documents and multiple references, it is unreliable at the sentence or sub-sentence levels, and with a single reference. In this paper, we propose new variants ofBLEU which address these limitations, resulting in a more flexible metric which is not only more reliable, but also allows for more accurate discriminative training. Our best metric has better correlation with human judgements than standard BLEU, despite using a simpler formulation. Moreover, these improvements carry over to a system tuned for our newmetric.

References

Page 1

	Year	Citations

Page 1