Back to Main Conference 2026
LREC 2026main

A Comprehensive Full-Form Lexicon for Arabic NLP and Speech Technology

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2gbvmmu4ix5e

Abstract

Natural Language Processing (NLP) applications require morphological data with precise grammatical attributes, while speech technology requires abundant phonemic and phonetic data. This presents a challenge for Arabic due to its abundant morphological, orthographic, and phonemic ambiguity in both MSA and its various dialects. Existing systems struggle with incomplete and unstructured web data, leading to suboptimal performance in both morphological analysis and speech applications. This paper presents ArabLEX, a full-form lexicon (includes all wordforms, i.e., fully inflected/cliticized members of a lexeme class) that addresses these issues by providing a large-scale database designed to enhance NLP accuracy. It comprises approximately 570 million entries with fully inflected forms and detailed morphological, phonetic, and orthographic attributes. ArabLEX serves as a foundational framework for developing comprehensive Arabic lexical resources for NLP, particularly for speech technology, as well as dialect databases.

Details

Paper ID
lrec2026-main-108
Pages
pp. 1382-1393
BibKey
haralambous-etal-2026-comprehensive
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • YH

    Yannis Haralambous

  • JH

    Jack Halpern

Links