A Morpho-Syntactically Annotated Corpus of Ògè Folk Narratives with a Focus on Nominal Structure
Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026
Abstract
This paper presents a manually annotated morpho-syntactic corpus of Ògè, an under-resourced indigenous language spoken in Nigeria. The corpus consists of ten folk narratives (approximately 4,667 tokens) collected for the investigation of nominal structure. Annotation is expert-driven and includes token-level part-of-speech tagging together with a structured Determiner Phrase (DP) classification framework designed to capture language-specific nominal configurations. The scheme distinguishes between bare nouns and modified noun phrases, reflecting a central structural property of Ògè: noun forms remain morphologically stable across contexts, while modifiers exhibit formal and positional variation contributing to reference, specificity, and discourse prominence. The DP classification layer encodes both simple and complex nominal constructions, enabling systematic analysis of internal phrase structure. Designed as a reusable digital resource, the corpus supports morphosyntactic tagging, noun phrase boundary detection, and modeling of nominal structure in low-resource NLP settings. The annotated dataset will be made publicly available through the SADiLaR repository. This work demonstrates how descriptive linguistic analysis can inform annotation design and provides a replicable framework for developing structured resources for under-resourced African languages. Keywords: Ògè, low-resource NLP, annotated corpus, nominal structure, African languages