Back to Main Conference 2026
LREC 2026main

A Large and Balanced Multi-Domain Arabic Corpus Annotated for Morphology, Syntax, and Readability

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/45f2o6t8piyi

Abstract

We present BAREC-10M, an expanded version of the Balanced Arabic Readability Evaluation Corpus (BAREC). This new release extends the original 1M-word corpus to 10 million words and broadens its scope to include balanced multi-domain coverage annotated for morphology, syntax, and readability. The corpus integrates 45 sub-corpora drawn from diverse sources, including news, educational materials, literature, children’s texts, and religious discourse. Each text is labeled for domain, readership level, and genre, and automatically analyzed using state-of-the-art morphological and syntactic tools. To enhance coverage of underrepresented varieties, we manually digitized and included children’s materials, magazines, and curriculum-based content. The resulting dataset provides a balanced resource for studying Arabic linguistic variation across styles, audiences, and levels of complexity.

Details

Paper ID
lrec2026-main-921
Pages
pp. 11761-11775
BibKey
elmadani-etal-2026-large
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • KE

    Khalid N. Elmadani

  • AW

    Adel Mahmoud Wizani

  • HT

    Hanada Taha Thomure

  • NH

    Nizar Habash

Links