Insights from Romanized Manipuri Social Media Text: A Transliteration Corpus and Variation Analysis

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

This paper presents the first large-scale study of Romanized Manipuri, a low-resource Indic language widely used by native speakers on social media. Social media text is highly informal and often noisy, posing challenges for natural language processing tasks; therefore, normalization through back-transliteration is essential. We construct a Romanized Manipuri to Manipuri–Bengali script back-transliteration corpus from YouTube comments, capturing diverse informal writing styles and orthographic variations. The dataset is analyzed to examine variation patterns at two levels: character-level inconsistencies and pragmatic stylistic variations influenced by user writing behavior. We also compare social media romanization with formal transliteration conventions, including standardized romanization schemes and textbook-based systems. Furthermore, we evaluate Transformer model at both character and subword levels and conduct a detailed error analyses to identify key challenges affecting back-transliteration performance.