Back to Main Conference 2014
LREC 2014main

Quality Estimation for Synthetic Parallel Data Generation

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/5aoqec3ux3qu

Abstract

This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an English―Croatian version of the Europarl parallel corpus based on the English―Slovene Europarl corpus and the Apertium rule-based translation system for Slovene―Croatian. These experiments are to be considered as a first step towards the generation of reliable synthetic parallel data for under-resourced languages. We first collect small amounts of aligned parallel data for the Slovene―Croatian language pair in order to build a quality estimation system for sentence-level Translation Edit Rate (TER) estimation. We then infer TER scores on automatically translated Slovene to Croatian sentences and use the best translations to build an English―Croatian statistical MT system. We show significant improvement in terms of automatic metrics obtained on two test sets using our approach compared to a random selection of synthetic parallel data.

Details

Paper ID
lrec2014-main-625
Pages
pp. 1843-1849
BibKey
rubino-etal-2014-quality
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • RR

    Raphael Rubino

  • AT

    Antonio Toral

  • NL

    Nikola Ljubešić

  • GR

    Gema Ramírez-Sánchez

Links