MaskedVerbalizer: Automatic Verbalizer Construction for Few-Shot Text Classification in Low-Resource Right-to-Left Languages
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Text classification in low-resource right-to-left languages faces significant challenges due to the scarcity of annotated data and the morphological richness of languages such as Arabic, Urdu, Sindhi, and Pashto. Arabic and Urdu alone are spoken by over 380+ million and 246+ million people worldwide, respectively. Pashto is the national language of Afghanistan, highlighting the importance of effective language technologies. While multilingual Pre-trained Language Models (PLMs) have shown promising results, they typically require extensive labeled datasets and computationally expensive fine-tuning to achieve better performance. Such limitations make these PLMs impractical for the low-resource settings described above. Therefore, we employ a few-shot strategy (zero, 4, or 8 shots) to achieve results comparable to those of standard fine-tuning. In this work, we propose MaskedVerbalizer, a novel technique designed for few-shot text classification. Our method introduces an automatic verbalizer construction approach that generates class-specific label words in 4-shot settings, eliminating the need for extensive manual intervention. Despite maintaining a simple model architecture, MaskedVerbalizer achieves effective performance in classification benchmarks. Experimental results demonstrate that our method effectively addresses the core challenges of low-resource text classification, providing a practical, computationally efficient solution. We achieved accuracies of 90.43% and 92.72% with mBERT and XLM-RoBERTa, respectively, representing improvements of 25–30% over soft and automatic verbalizers. The code for MaskedVerbalizer is publicly available at https://github.com/Furqann-hue/MV.