Back to Main Conference 2024
LREC-COLING 2024main

Arabic Diacritization Using Morphologically Informed Character-Level Model

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5pgzxaektewi

Abstract

Arabic diacritic recovery i.e. diacritization is necessary for proper vocalization and an enabler for downstream applications such as language learning and text to speech. Diacritics come in two varieties, namely: core-word diacritics and case endings. In this paper we introduce a highly effective morphologically informed character-level model that can recover both types of diacritics simultaneously. The model uses a Recurrent Neural Network (RNN) based architecture that takes in text as a sequence of characters, with markers for morphological segmentation, and outputs a sequence of diacritics. We also introduce a character-based morphological segmentation model that we train for Modern Standard Arabic (MSA) and dialectal Arabic. We demonstrate the efficacy of our diacritization model on Classical Arabic, MSA, and two dialectal (Moroccan and Tunisian) texts. We achieve the lowest reported word-level diacritization error rate for MSA (3.4%), match the best results for Classical Arabic (5.4%), and report competitive results for dialectal Arabic.

Details

Paper ID
lrec2024-main-0128
Pages
pp. 1446-1454
BibKey
elmallah-etal-2024-arabic
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • ME

    Muhammad Morsy Elmallah

  • MR

    Mahmoud Reda

  • KD

    Kareem Darwish

  • AE

    Abdelrahman El-Sheikh

  • AE

    Ashraf Hatim Elneima

  • MA

    Murtadha Aljubran

  • NA

    Nouf Alsaeed

  • RM

    Reem Mohammed

  • MA

    Mohamed Al-Badrashiny

Links