Back to Main Conference 2024
LREC-COLING 2024main

Multilinguality or Back-translation? A Case Study with Estonian

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/3ixb5a5ypn4e

Abstract

Machine translation quality is highly reliant on large amounts of training data, and, when a limited amount of parallel data is available, synthetic back-translated or multilingual data can be used in addition. In this work, we introduce SynEst, a synthetic corpus of translations from 11 languages into Estonian which totals over 1 billion sentence pairs. Using this corpus, we investigate whether adding synthetic or English-centric additional data yields better translation quality for translation directions that do not include English. Our results show that while both strategies are effective, synthetic data gives better results. Our final models improve the performance of the baseline No Language Left Behind model while retaining its source-side multilinguality.

Details

Paper ID
lrec2024-main-1033
Pages
pp. 11838-11848
BibKey
korotkova-etal-2024-multilinguality
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • EK

    Elizaveta Korotkova

  • TP

    Taido Purason

  • AL

    Agnes Luhtaru

  • MF

    Mark Fishel

Links