Sentiment Analysis and Language Models for Kwanyama
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Kwanyama is related to Swahili, Zulu, and, the more than 300 other languages in the Bantu family. Yet, unlike its better-known relatives, it remains almost entirely absent from modern Natural Language Processing (NLP). We bring Kwanyama into the LLM era of NLP through two key contributions. First, we introduce OkaSentiment, the first sentiment-labeled dataset for Kwanyama. Unlike prior African sentiment corpora that rely primarily on social media, OkaSentiment is grounded in an offline, culturally relevant domain: reviews of domestic labor relationships. The dataset is annotated by over 40 native speakers under expert supervision, with careful quality control. Second, we present OkaLM, the first language models for Kwanyama (1B, 3B, and 8B parameters), obtained by continued pretraining of LLaMA-3 checkpoints on a curated Kwanyama corpus. Together, OkaSentiment and OkaLM bring a left-behind language into the landscape of modern NLP, providing its first benchmark and language models.