Corruption-Based Data Augmentation for Arabic Essay Scoring: A Preliminary Study on the Organization Trait

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Despite significant advances in Automated Essay Scoring (AES), progress in Arabic AES remains limited by the scarcity and imbalance of publicly available datasets. Manual curation of such data is labor-intensive and lacks scalability. To address this, we introduce COrE, a corruption-based data augmentation method that targets the organization trait of Arabic essays. COrE generates synthetic essays by intentionally disrupting the organization of well-written essays through controlled, distance-aware sentence swapping. Our experiments are conducted on TAQAE, a dataset of 620 essays across 4 distinct writing prompts. We evaluate the effectiveness of COrE using two widely-adopted pre-trained models: AraBERTv2 and CAMeLBERT-mix. Both models show improved performance with COrE, achieving gains of 9-17% over the no-augmentation baseline. These results highlight the potential of trait-specific augmentation to address data scarcity and enhance AES performance for low-resource languages.