Back to Main Conference 2004
LREC 2004main

How Does Automatic Machine Translation Evaluation Correlate with Human Scoring as the Number of Reference Translations Increases?

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/2agzmwdvyn96

Abstract

Automatic machine translation evaluation is a very difficult task due to the wide diversity of valid output translations that may result from translating a single source sentence or textual segment. Recently a number of competing methods of automatic machine translation evaluation have been adopted by the research community, of these the some of the most utilized are BLEU, NIST, mWER and the F-measure. This work extends the work of others in the field looking at how closely these evaluation techniques match human performance at ranking the translation output. However, we focus on investigating how these systems scale up with increasing numbers of human-produced references. We measure the correlation of the automatic ranking of the output from nine different machine translation systems, with the ranking derived from the score assigned by nine human evaluators using up to sixteen references per sentence. Our results show that evaluation performance improves with increasing numbers of references for all of the scoring methods except NIST which only shows improvements with small numbers of references.

Details

Paper ID
lrec2004-main-147
Pages
N/A
BibKey
finch-etal-2004-automatic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • AF

    Andrew Finch

  • YA

    Yasuhiro Akiba

  • ES

    Eiichiro Sumita

Links