Word segmentation for UD: a comparison of isiZulu and Sepedi
Proceedings of the Ninth Workshop on Universal Dependencies (UDW 2026)
Abstract
The Southern Bantu language family contains languages with so-called conjunctive orthographies and disjunctive orthographies. In languages with conjunctive orthographies, such as isiZulu, orthographic words correspond to linguistic words, whereas in languages with disjunctive orthographies, prefix morphemes of verbs and other predicates are written as disjunct, orthographic words. When developing Universal Dependencies treebanks, the basic principle is to consider syntactic (linguistic) words, but for languages with agglutinating morphology, it has been argued that this reduces the informativeness of the treebank. In this paper we investigate this claim by analysing and measuring the effects of annotating universal dependencies on the basis of orthographic words on two morphosyntactically parallel treebanks for isiZulu and Sepedi.