Back to Main Conference 2024
LREC-COLING 2024main

Analysis on Unsupervised Acquisition Process of Bilingual Vocabulary through Iterative Back-Translation

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5dejb9okao2y

Abstract

In this paper, we investigate how new bilingual vocabulary is acquired through Iterative Back-Translation (IBT), which is known as a data augmentation method for machine translation from monolingual data of both source and target languages. To reveal the acquisition process, we first identify the word translation pairs in test data that do not exist in a bilingual data but do only in two monolingual data, then observe how many pairs are successfully translated by the translation model trained through IBT. We experimented on it with domain adaptation settings on two language pairs. Our experimental evaluation showed that more than 60% of the new bilingual vocabulary is successfully acquired through IBT along with the improvement in the translation quality in terms of BLEU. It also revealed that new bilingual vocabulary was gradually acquired by repeating IBT iterations. From the results, we present our hypothesis on the process of new bilingual vocabulary acquisition where the context of the words plays a critical role in the success of the acquisition.

Details

Paper ID
lrec2024-main-0080
Pages
pp. 887-892
BibKey
tanigawa-etal-2024-analysis
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • TT

    Takuma Tanigawa

  • TA

    Tomoyosi Akiba

  • HT

    Hajime Tsukada

Links