Back to RAIL 2024
LREC-COLING 2024workshop

Developing Bilingual English-Setswana Datasets for Space Domain

Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

DOI:10.63317/3cr7bjd8jyfd

Abstract

In the current digital age, languages lacking digital presence face an imminent risk of extinction. In addition, the absence of digital resources poses a significant obstacle to the development of Natural Language Processing (NLP) applications for such languages. Therefore, the development of digital language resources contributes to the preservation of these languages and enables application development. This paper contributes to the ongoing efforts of developing language resources for South African languages with a specific focus on Setswana and presents a new English-Setswana bilingual dataset that focuses on the space domain. The dataset was constructed using the expansion method. A subset of space domain English synsets from Princeton WordNet was professionally translated to Setswana. The initial submission of translations demonstrated an accuracy rate of 99% before validation. After validation, continuous revisions and discussions between translators and validators resulted in a unanimous agreement, ultimately achieving a 100% accuracy rate. The final version of the resource was converted into an XML format due to its machine-readable framework, providing a structured hierarchy for the organization of linguistic data.

Details

Paper ID
lrec2024-ws-rail-04
Pages
pp. 32-36
BibKey
moape-etal-2024-developing
Editors
Mabuya Rooweither, Matfunjwa Muzi, Setaka Mmasibidi, van Zaanen Menno
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Location
Turin, Italy
Date
20 - 25 May 2024

Authors

  • TM

    Tebatso G. Moape

  • SO

    Sunday Olusegun Ojo

  • OO

    Oludayo O. Olugbara

Links