HomeLREC 2026WorkshopsRAILlrec2026-ws-rail-02
Back to RAIL 2026
LREC 2026workshop

Extension of Linguistic Resources for South African Languages: Part-of-Speech Annotated Domain-Specific Data

Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026

DOI:10.63317/4agbqhnpu5hu

Abstract

In this paper, we present part-of-speech (POS) annotated domain-specific data for nine South African languages. The data has been sourced from five different domains (two academic domains, Caps and theses, two non-academic domains, news and magazines, and one fiction domain, novels), uniformly pre-processed, automatically POS-tagged and then corrected by linguistic experts. The widely used NCHLT government data sets (Eiselen and Puttkammer, 2014) have also been re-tagged with the current tag sets and manually corrected. Both the new domain-specific data sets and the re-tagged NCHL data sets have been uploaded into a public repository. To illustrate the characteristics of the domain data in comparison to government data, we include and discuss data statistics, namely type-token ration (TTR), tokens per sentence and out-of-vocabulary (OOV) rates, as well as POS tagging results with a baseline tagger trained on NCHLT data and applied to the different domains for all languages. Both the data statistics and the POS results clearly show that the domain data is significantly different to government data: For all domains and languages, the tagging accuracy decreases significantly compared to testing on in-domain government data. Also, POS results for the two domains with the highest OOV rates for all languages (Caps and novels) are much lower than for the other domains. These findings emphasise the need for more diverse data resources which in turn will aid in the development of more domain-independent language technologies.

Details

Paper ID
lrec2026-ws-rail-02
Pages
pp. 7-19
BibKey
gaustad-etal-2026-extension
Editors
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • TG

    Tanja Gaustad

  • RE

    Roald Eiselen

  • CM

    Cindy Arlene McKellar

Links