Benchmarking Text Embedding Models for South African Languages

Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026

Abstract

In this work we introduce a collection of monolingual embedding models for ten South African languages in four different architectures. To determine the quality of the embedding models we evaluate the embeddings on two sequence-labelling tasks, namely Part-of-Speech (POS) tagging and Named Entity Recognition (NER). Languages are grouped into conjunctive (isiNdebele, isiXhosa, isiZulu, and Siswati), disjunctive (Sepedi, Sesotho, Setswana, Tshivenḓa, and Xitsonga), and Afrikaans to establish the influence of training data set size and typology on the quality of the different embeddings. To isolate representation effects we train BiLSTM-CRF taggers, while keeping the architecture, data splits, and training budget fixed, varying only the input imbedding representations, namely GloVe, fastText, Flair, and RoBERTa. In our experiments, GloVe lags behind fastText, Flair, and the transformer-based models, confirming that static word-level vectors are less suited to morphologically complex, low-resource languages. Subword-aware embeddings such as fastText remain a reliable and computationally efficient baseline, while Flair is the most competitive overall across both POS tagging and NER tasks.