ADAB: Arabic Dataset for Automated Politeness Benchmarking - a Large-Scale Resource for Computational Sociopragmatics
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain severely under-explored, despite the rich and complex politeness expressions deeply embedded in Arabic communication. In this paper, a new annotated Arabic dataset, called ADAB/أدب (Arabic Politeness Dataset), was generated and carefully collected from four diverse online platforms including social media, e-commerce, and customer service domains, encompassing both Modern Standard Arabic (MSA) and multiple dialectal varieties (Gulf, Egyptian, Levantine, and Maghrebi). This dataset has undergone a thorough annotation process guided by Arabic linguistic traditions and contemporary pragmatic theory, resulting in three-way politeness classifications: polite, impolite, and neutral. The generated dataset contains 10,000 samples with detailed linguistic feature annotations across 16 politeness categories, achieving substantial inter-annotator agreement (κ = 0.703). A comprehensive benchmarking of this dataset was conducted utilizing 40 model configurations spanning traditional machine learning (12 models), transformer-based architecture (10 models), and large language models (18 configurations), thereby effectively demonstrating its practical utility and inherent challenges. This generated resource aims to bridge the gap in Arabic sociopragmatic NLP and encourage further research into politeness-aware applications for the Arabic language.