Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

TLDR

Evaluation is a valuable tool in Human Language Technology R&D, yet it has been limited in machine translation research due to its reliance on costly human judgments. The authors aimed to introduce an automatic MT evaluation technique that offers immediate feedback and guidance for MT research. The technique, called an “evaluation understudy,” compares machine‑translated output to expert reference translations by counting shared short word N‑grams, with higher overlap indicating better quality. The method demonstrated a strong correlation with human judgments, leading DARPA to commission NIST to develop a publicly available evaluation facility that now serves as the primary metric for TIDES MT research.

Abstract

Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM described an automatic MT evaluation technique that can provide immediate feedback and guidance in MT research. Their idea, which they call an "evaluation understudy", compares MT output with expert reference translations in terms of the statistics of short sequences of words (word N-grams). The more of these N-grams that a translation shares with the reference translations, the better the translation is judged to be. The idea is elegant in its simplicity. But far more important, IBM showed a strong correlation between these automatically generated scores and human judgments of translation quality. As a result, DARPA commissioned NIST to develop an MT evaluation facility based on the IBM work. This utility is now available from NIST and serves as the primary evaluation measure for TIDES MT research.