Back to Main Conference 2002
LREC 2002main

Machine Translation Evaluation: N-grams to the Rescue

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

DOI:10.63317/2fjqfypf5jsh

Abstract

Human judges weigh many subtle aspects of translation quality. But human evaluations are very expensive. Developers of Machine Translation systems need to evaluate quality constantly. Automatic methods that approximate human judgment are therefore very useful. The main difculty in automatic evaluation is that there are many correct translations that differ in choice and order of words. There is no single gold standard to compare a translation with. The closer a machine translation is to professional human translations, the better it is. We borrow precision and recall concepts from Information Retrieval to measure closeness. The precision measure is used on variablelength n-grams. Unigram matches between machine translation and the professional reference translations account for adequacy. Longer n-gram matches account for uency. The n-gram precisions are aggregated across sentences and averaged. A multiplicative brevity penalty prevents cheating. The resulting metric correlates highly with human judgments of translation quality. This method is tested for robustness across language families and across the spectrum of translation quality. We discuss BLEU, an automatic method to evaluate translation quality that is cheap, fast, and good.

Details

Paper ID
lrec2002-main-347
Pages
N/A
BibKey
papineni-2002-machine
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Third International Conference on Language Resources and Evaluation
Location
Las Palmas, Spain
Date
29 May 2002 31 May 2002

Authors

  • KP

    Kishore Papineni

Links