Comparing Source Language Selection Strategies for Multi-Source Cross-Lingual Transfer to African Languages
Proceedings of Resources for African Indigenous Languages (RAIL) 2026 @ LREC 2026
Abstract
Cross-lingual transfer learning enables building NLP systems for low-resource languages by leveraging data from higher-resource languages. A critical but understudied question for African languages is: which source languages should be selected for multi-source transfer? We present a systematic comparison of four source language selection strategies: random selection (baseline), genetic distance based on language family trees, geographic distance based on speaker locations, and embedding similarity from multilingual models. We evaluate these strategies on Named Entity Recognition, Part-of-Speech tagging, and sentiment analysis across five typologically diverse African target languages (Hausa, Yoruba, Swahili, Igbo, Kinyarwanda) using three multilingual models. We further investigate how the number of source languages affects transfer performance. Our experiments reveal that no single strategy dominates across tasks: geographic distance leads on sequence labeling tasks while embedding similarity is most effective for sentiment analysis, and all informed strategies consistently outperform random selection.