First Steps in ASR for Cypriot Greek: Challenges and Insights
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Abstract
This paper presents the first automatic speech recognition (ASR) system for Cypriot Greek, a non-standardized variety of Modern Greek with distinctive phonological, lexical, and orthographic characteristics. We adapt Whisper, a state-of-the-art multilingual ASR model, to Cypriot Greek through fine-tuning on the mozilla common voice spontaneous speech dataset for Cypriot Greek. The phonological and lexical divergence of Cypriot Greek from Standard Modern Greek poses significant challenges for mainstream ASR, particularly under conditions of limited training data and dialectal variation. Results demonstrate that whisper-medium achieved a best word error rate (WER) of 37.85%, while whisper-large-v3 consistently outperformed it, reaching a minimum WER of 33.93%. In the light of these findings, increased model size, combined with targeted fine-tuning on normalized dialectical data, significantly improves recognition accuracy, indicating that careful handling of orthographic and dialectical variation provides an effective path for ASR adaptation to low-resource varieties.