Back to Main Conference 2024
LREC-COLING 2024main

Learning from Wrong Predictions in Low-Resource Neural Machine Translation

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/4jpn3p22cenc

Abstract

Resource scarcity in Neural Machine Translation is a challenging problem in both industry applications and in the support of less-spoken languages represented, in the worst case, by endangered and low-resource languages. Many Data Augmentation methods rely on additional linguistic sources and software tools but these are often not available in less favoured language. For this reason, we present USKI (Unaligned Sentences Keytokens pre-traIning), a pre-training strategy that leverages the relationships and similarities that exist between unaligned sentences. By doing so, we increase the dataset size of endangered and low-resource languages by the square of the initial quantity, matching the typical size of high-resource language datasets such as WMT14 En-Fr. Results showcase the effectiveness of our approach with an increase on average of 0.9 BLEU across the benchmarks using a small fraction of the entire unaligned corpus, suggesting the importance of the research topic and the potential of a currently under-utilized resource and under-explored approach.

Details

Paper ID
lrec2024-main-0896
Pages
pp. 10263-10273
BibKey
hu-etal-2024-learning
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • JH

    Jia Cheng Hu

  • RC

    Roberto Cavicchioli

  • GB

    Giulia Berardinelli

  • AC

    Alessandro Capotondi

Links