A Speech Resource for the Pontic Greek Dialect: Transcription Choices and Baseline ASR Evaluation
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Abstract
Pontic Greek is a living but endangered Modern Greek dialect that lacks publicly available AI-oriented speech resources and ASR benchmarks. This work reports on the first systematic inference-only (zero-shot) ASR evaluation on authentic Pontic speech. Progress on Pontic ASR is hindered by two coupled challenges: the scarcity of transcribed speech data and the absence of a standardized orthography, which makes it difficult to create consistent reference transcriptions for evaluation. We address these challenges by releasing a new speech corpus of contemporary Pontic as spoken in Northern Greece, derived from natural conversations and provided with manual, utterance-level, time-aligned transcriptions. To reduce annotator bias and increase practical usability, we collect community evidence on written-form preferences via a small questionnaire and use the observed patterns to guide a consistent Greek-script transcription scheme. We use this corpus to perform inference-only (zero-shot) ASR evaluation, benchmarking four state-of-the-art pretrained speech recognition models under a unified evaluation protocol. Results show that zero-shot recognition remains challenging, establishing baseline figures and underscoring the need for dialect-specific data and adaptation.