LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task

Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

DOI:10.63317/2sqhrzmrh67i

Abstract

This paper presents our approach to the Dialect to Modern Standard Arabic (MSA) Machine Translation shared task, conducted as part of the sixth Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6). Our primary contribution is the development of a novel dataset derived from The Saudi Audio Dataset for Arabic (SADA) an Arabic audio corpus. By employing an automated method utilizing ChatGPT 3.5, we translated the dialectal Arabic texts to their MSA equivalents. This process not only yielded a unique and valuable dataset but also showcased an efficient method for leveraging language models in dataset generation. Utilizing this dataset, alongside additional resources, we trained a machine translation model based on the Transformer architecture. Through systematic experimentation with model configurations, we achieved notable improvements in translation quality. Our findings highlight the significance of LLM-assisted dataset creation methodologies and their impact on advancing machine translation systems, particularly for languages with considerable dialectal diversity like Arabic.

Resources

Details

Paper ID

lrec2024-ws-osact-14

Pages

pp. 112-116

DOI

10.63317/2sqhrzmrh67i

BibKey

abdelaziz-etal-2024-llm

Editors

Hend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, Tamer Elsayed

Publisher

European Language Resources Association (ELRA) and ICCL

ISSN

N/A

ISBN

N/A

Workshop

Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

Location

Turin, Italy

Date

20 - 25 May 2024

Authors

AA
AhmedElmogtaba Abdelmoniem Ali Abdelaziz
AE
Ashraf Hatim Elneima
KD
Kareem Darwish

Links

URL

DOI