Exploring the reusability of Northern Kurdish resources for Badini speech recognition

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

Abstract

Badini is a variant of the Kurdish language spoken in the Duhok province of the Kurdistan Region of Iraq. It is written mainly in a modified version of the Arabic script. Although it shares the same script as Central Kurdish (CKB), it is linguistically classified under the Northern Kurdish (KMR) branch. In this paper, we explore the potential and limitations of Northern Kurdish ASR resources for the Badini variant. Firstly, we transliterate the Common Voice 18 dataset from the Latin script into the modified Arabic script and revised it to align with the orthographic conventions of Badini variant. Additionally, we introduce the first text collection for the Badini variant, containing 14,22 million tokens, which serves as a source for speech synthesis. A third resource developed in this research is a standard speech recognition benchmark recorded by 5 speakers which includes 2 hours and 46 minutes of multi-domain read speech. Results show that combining transliterated and synthetic data significantly improves recognition accuracy, achieving a 6.8% CER and 34% WER. All three resources curated during this research will be made available under the CC BY-NC-ND 4.0 license.