HomeLREC 2026WorkshopsRAILlrec2026-ws-rail-13
Back to RAIL 2026
LREC 2026workshop

Open but Unvetted: The Ethics of African Language Data

Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026

DOI:10.63317/2ewucnr8r5hq

Abstract

Creative Commons (CC) licenses are prevalent in African natural language processing (NLP) corpus releases, but their compatibility implications are rarely examined systematically. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs (ND) clause prohibits redistribution of tokenised or annotated derivatives. This paper presents an empirical audit of license provenance across more than twenty corpus families used in African NLP, applying established compatibility rules to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed a Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its HuggingFace dataset card); a ND restriction not reflected in the CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are no longer accessible). A due diligence checklist and a survey of legally compliant enrichment opportunities conclude the paper. We argue that lawful data use is an ethical baseline: for African language communities with limited institutional recourse, license violations are not only legal risks but ethical failures that compound existing power asymmetries.

Details

Paper ID
lrec2026-ws-rail-13
Pages
pp. 128-139
BibKey
vangassen-2026-open
Editors
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • Ev

    Ernst A.P. van Gassen

Links