Reclaiming African Voices: Surveying Indigenous Writing Systems for Inclusive NLP
Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026
Abstract
Multilingual NLP has expanded rapidly through large-scale pretraining and cross-lingual transfer, yet this progress remains structurally uneven across writing systems. This survey reframes multilingual NLP around scripts rather than languages, arguing that writing systems constitute an under-theorized axis of computational inequality. Focusing on African scripts — Indigenous (Vai, Ge’ez, Tifinagh), modern (ADLaM, N’Ko), and adapted Arabic-based (Ajami)—we analyze how script properties interact with digital infrastructure, tokenization, and downstream task performance. We organize the literature across four analytical layers: infrastructural (Unicode and input systems), representational (segmentation efficiency and vocabulary allocation), functional (task-level disparities), and epistemic (evaluation bias and the "low-resource" framing). Synthesizing evidence from 47 studies, we show that performance gaps across scripts arise primarily from engineering design choices rather than intrinsic linguistic complexity. We conclude by outlining a research agenda for native multiscript foundation models, including script-aware scaling laws, tokenizer equity metrics, and evaluation reform. We argue that multiscript equity is not a peripheral concern but a structural precondition for genuine multilingual inclusion