Automatic Ranking of MT Systems

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

Abstract

In earlier work, we succeeded in automatically predicting the relative rankings of MT systems derived from human judgments on the Fluency, Adequacy or Informativeness of their output. In this paper, we present an experiment - using human evaluators and additional data - designed to test the robustness of our earlier results. These had yielded two promising automatically computable predictors, the D-score based on semantic features of the MT output, and the X-score based on syntactic features. We conclude that the X-score is indeed a robust and reliable predictor, even on new data for which it has not been specifically tuned.