HomeLREC 2026WorkshopsOSACTlrec2026-ws-osact-14
Back to OSACT 2026
LREC 2026workshop

DIA2 - a Comprehensive and Diverse Diacritized Arabic Corpus for NLP Research

The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7) with 5 Shared Tasks

DOI:10.63317/3k2m7vtzunuk

Abstract

The development of Arabic natural language processing (NLP) applications and large language models (LLMs) faces substantial challenges, primarily due to the scarcity of high-quality native Arabic datasets. To address this critical gap, we present DIA2 (a Comprehensive and Diverse Diacritized Modern Standard Arabic Corpus), a novel dataset curated from 28 diverse, carefully selected Arabic sources. DIA2 emphasizes the use of original Arabic text and explicitly avoids machine-translated content. The corpus incorporates substantial amounts of text from books, news articles, and poetry, and employs extensive data preprocessing to support NLP research and LLM development. Our preprocessing pipeline includes rigorous text cleaning, URL- and document-level deduplication, and automatic diacritization, while preserving a gold diacritized subset derived from manually annotated sources. The resulting corpus comprises over 140 GB of high-quality text, containing more than 26 million unique words and 41.9 billion tokens. To evaluate the proposed pipeline, we conducted controlled continued pretraining experiments using Llama3.1-8B on both raw and processed subsets of DIA2. The model trained on processed data consistently outperformed its counterpart across multiple Arabic evaluation benchmarks. These results highlight the positive impact of systematic preprocessing and the utility of DIA2 in empowering native Arabic LLMs and downstream NLP tasks.

Details

Paper ID
lrec2026-ws-osact-14
Pages
pp. 115-130
BibKey
dekmak-etal-2026-dia2
Editors
Hend Al-Khalifa, Mo El-Haj, Saad Ezzini
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7) with 5 Shared Tasks
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • FD

    Fatima Dekmak

  • SE

    Shady Elbassuoni

  • KS

    Khaled Shaban

  • HH

    Hazem Hajj

  • WE

    Wassim El-Hajj

  • YA

    Yasmine Abu Adla

  • BA

    Buthaina Alabrash

Links