HomeLREC 2026WorkshopsRAILlrec2026-ws-rail-05
Back to RAIL 2026
LREC 2026workshop

Benchmarking Text Embedding Models for South African Languages

Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026

DOI:10.63317/28nicfxsyx92

Abstract

In this work we introduce a collection of monolingual embedding models for ten South African languages in four different architectures. To determine the quality of the embedding models we evaluate the embeddings on two sequence-labelling tasks, namely Part-of-Speech (POS) tagging and Named Entity Recognition (NER). Languages are grouped into conjunctive (isiNdebele, isiXhosa, isiZulu, and Siswati), disjunctive (Sepedi, Sesotho, Setswana, Tshivenḓa, and Xitsonga), and Afrikaans to establish the influence of training data set size and typology on the quality of the different embeddings. To isolate representation effects we train BiLSTM-CRF taggers, while keeping the architecture, data splits, and training budget fixed, varying only the input imbedding representations, namely GloVe, fastText, Flair, and RoBERTa. In our experiments, GloVe lags behind fastText, Flair, and the transformer-based models, confirming that static word-level vectors are less suited to morphologically complex, low-resource languages. Subword-aware embeddings such as fastText remain a reliable and computationally efficient baseline, while Flair is the most competitive overall across both POS tagging and NER tasks.

Details

Paper ID
lrec2026-ws-rail-05
Pages
pp. 41-51
BibKey
devilliers-etal-2026-benchmarking
Editors
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • Od

    Ockert de Villiers

  • RE

    Roald Eiselen

Links