A Study of Translation Edit Rate with Targeted Human Annotation

TLDR

The study introduces a new intuitive metric, Human‑Targeted TER, to evaluate machine translation quality without requiring extensive linguistic knowledge or labor‑intensive human judgments, and demonstrates its superior correlation with human assessments. TER counts the edits a human would perform to convert a machine‑generated sentence into a reference translation, offering a simple, resource‑light evaluation method. The single‑reference TER correlates with human judgments as strongly as the four‑reference BLEU, while Human‑Targeted TER achieves even higher correlations—exceeding BLEU and HMETEOR—and the multi‑reference variants of TER and HTER match or surpass the consistency of a second human evaluation.

Abstract

We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Edit Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU. We also define a human-targeted TER (or HTER) and show that it yields higher correlations with human judgments than BLEU—even when BLEU is given human-targeted references. Our results indicate that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate with human judgments as well as—or better than—a second human judgment does.

References

Page 1

	Year	Citations

Page 1