Request Correction

Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.

Correction Guidelines

Click the edit button next to a field to report a correction.
Fill in the suggested correction value for each field you want to correct.
Provide your name and email so we can contact you if needed.

View all submitted correction requests

Paper Information

lrec2026-ws-rail-02

Extension of Linguistic Resources for South African Languages: Part-of-Speech Annotated Domain-Specific Data

View lrec2026-ws-rail-02.pdf

Paper Fields

Click the edit button next to a field to report a correction.

Title

Extension of Linguistic Resources for South African Languages: Part-of-Speech Annotated Domain-Specific Data

Abstract

In this paper, we present part-of-speech (POS) annotated domain-specific data for nine South African languages. The data has been sourced from five different domains (two academic domains, Caps and theses, two non-academic domains, news and magazines, and one fiction domain, novels), uniformly pre-processed, automatically POS-tagged and then corrected by linguistic experts. The widely used NCHLT government data sets (Eiselen and Puttkammer, 2014) have also been re-tagged with the current tag sets and manually corrected. Both the new domain-specific data sets and the re-tagged NCHL data sets have been uploaded into a public repository. To illustrate the characteristics of the domain data in comparison to government data, we include and discuss data statistics, namely type-token ration (TTR), tokens per sentence and out-of-vocabulary (OOV) rates, as well as POS tagging results with a baseline tagger trained on NCHLT data and applied to the different domains for all languages. Both the data statistics and the POS results clearly show that the domain data is significantly different to government data: For all domains and languages, the tagging accuracy decreases significantly compared to testing on in-domain government data. Also, POS results for the two domains with the highest OOV rates for all languages (Caps and novels) are much lower than for the other domains. These findings emphasise the need for more diverse data resources which in turn will aid in the development of more domain-independent language technologies.

Authors

Expand an author to correct their information. Use the remove button to request author removal, or add a new author.

PDF Attachment

You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.

Drag & drop a PDF here, or click to select

Your Information

Name

Comment

Author Declaration *

I declare that I have notified all co-authors of the proposed corrections and obtained their consent, and that all modifications adhere to research ethics standards and the LREC correction policy.

Select at least one field to correct using the edit buttons above.