From Romanized to Devanagari: Enhancing Nepali Sentiment Analysis with NepaliXlit
Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)
Abstract
Romanized Nepali is the dominant medium of social media communication in Nepal, yet most multilingual NLP models are trained on Devanagari, creating a noticeable drop in performance in informal settings. To address this script mismatch, we develop NepaliXlit, a transliteration model fine-tuned from IndicXlit to better handle the phonetic variability of Romanized Nepali. Trained on 2,943 informal word pairs and evaluated on 736 held-out pairs, NepaliXlit improves transliteration accuracy by 8% and reduces character error rate by 11%. We use sentiment analysis as a testbed to understand whether transliteration actually helps downstream NLP tasks. We curate over 6,500 Romanized social media comments and construct a balanced subset of 1,518 manually annotated instances. Baseline experiments show that multilingual encoder models struggle with Romanized input; however, transliterating text into Devanagari using NepaliXlit consistently improves sentiment classification accuracy with mBERT and MuRIL. Comparative evaluation against large language models (LLMs) further reveals that generative models such as Gemini and GPT variants exhibit strong cross-script generalization and outperform encoder-based baselines. Our results indicate that adaptive transliteration enhances conventional multilingual models, while modern LLMs offer a better alternative for multi-script, low-resource settings.