Open but Unvetted: The Ethics of African Language Data

Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026

Abstract

Creative Commons (CC) licenses are prevalent in African natural language processing (NLP) corpus releases, but their compatibility implications are rarely examined systematically. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs (ND) clause prohibits redistribution of tokenised or annotated derivatives. This paper presents an empirical audit of license provenance across more than twenty corpus families used in African NLP, applying established compatibility rules to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed a Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its HuggingFace dataset card); a ND restriction not reflected in the CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are no longer accessible). A due diligence checklist and a survey of legally compliant enrichment opportunities conclude the paper. We argue that lawful data use is an ethical baseline: for African language communities with limited institutional recourse, license violations are not only legal risks but ethical failures that compound existing power asymmetries.