Pontic Greek in the Caucasus: an online corpus
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Abstract
This paper presents a multi-media corpus of Pontic Greek as spoken by Pontic Greek speakers in the Caucasus (Georgia). The corpus covers three major stages reflecting different sociolinguistic settings: (a) Ponitc Greek in small rural communities in Georgia (original settlements); (b) internal migration to urban centers (within Georgia), (c) external migration (to Greece). The dataset comprises 373 audio recordings (total duration 7h 26m; total word count: 43.073). The open-access resource includes audio files (wav) and annotations (xml). Annotations provide orthographical transcription, morphemic transcription and morpheme-by-morpheme and sentence-by-sentence translations in English (Toolbox); transcriptions are time-aligned with the audio files (ELAN). This collection is intended to linguists working on dialectology and language contact, as well as people with broader interests about the history and practices of this community. Pontic Greek in the Caucasus offers a unique opportunity to investigate contact between Greek and another Indo-European language (Russian) as well as two Non-Indo-European languages (Georgian, Turkish).