Bootstrapping NLP for Sakha: Named Entity Recognition and Sentiment Analysis in an Extremely Low-Resource Setting
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present the first systematic study of core NLP tasks for Sakha (Yakut), a low-resource Turkic language with approximately 450,000 speakers in northeastern Siberia. We introduce two manually annotated datasets: a 690-sentence NER corpus (921 entities: PER, LOC, ORG) and an 798-sentence sentiment corpus (positive, negative, neutral). Using mBERT and RuBERT in controlled 2×2 experiments, we report a twofold effect: on the one hand, it improves performance when base unknown-token rates exceed approximately 10% (RuBERT: +9.4 F1); on the other hand, it leads to worse performance otherwise (mBERT: −6.1 F1), despite improving tokenization in both cases. Cross-domain transfer (news vs forums) reveals severe asymmetry: formal-to-informal training achieves 47% accuracy while the reverse yields only 26%—a 21-point gap demonstrating that domain composition dominates model architecture choice in low-resource settings. Neutral-boundary detection is the primary bottleneck, with 89% of disagreements clustering around subjective/objective distinctions rather than polarity confusions. With fewer than 1,000 samples per task, we establish first benchmarks for Sakha NER (53.5 F1) and sentiment analysis (54% accuracy).