Back to Main Conference 2022
LREC 2022main

MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/2av2mm9gpjb9

Abstract

We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains annotations for eleven language pairs, including both high- and low-resource languages. Specifically, it is annotated for translation quality with human labels for up to 10,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level binary good/bad labels. Apart from the quality-related scores, each source-translation sentence pair is accompanied by the corresponding post-edited sentence, as well as titles of the articles where the sentences were extracted from, and information on the neural MT models used to translate the text. We provide a thorough description of the data collection and annotation process as well as an analysis of the annotation distribution for each language pair. We also report the performance of baseline systems trained on the MLQE-PE dataset. The dataset is freely available and has already been used for several WMT shared tasks.

Details

Paper ID
lrec2022-main-530
Pages
pp. 4963-4974
BibKey
fomicheva-etal-2022-mlqe
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • MF

    Marina Fomicheva

  • SS

    Shuo Sun

  • EF

    Erick Fonseca

  • CZ

    Chrysoula Zerva

  • FB

    Frédéric Blain

  • VC

    Vishrav Chaudhary

  • FG

    Francisco Guzmán

  • NL

    Nina Lopatina

  • LS

    Lucia Specia

  • AM

    André F. T. Martins

Links