A CLDF-Compliant Lexical Database for Modern Greek Dialects: Resource Design and Dialectometric Analysis
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Abstract
This paper presents the first CLDF database that systematically documents lexical variation across 36 Modern Greek varieties (including Standard Modern Greek). The dataset aligns 14,378 lexical items over 345 concepts, links varieties to stable identifiers (Glottocodes), introduces a Greek-specific concept list, and maps meanings to standardized Concepticon concept sets, enabling interoperability and reproducible workflows. To assess whether the database preserves a meaningful dialectological signal, we conduct a dialectometric analysis by computing feature-sensitive string distances over IPA transcriptions and applying hierarchical clustering. The resulting similarity structure recovers major macro-divisions—most notably a broad Northern vs. Southern partition among Koine-descended mainland varieties—and isolates peripheral groups with distinct historical trajectories (e.g., Asia Minor, Italiot, Tsakonian). The database provides scalable infrastructure for quantitative dialectology, comparative Greek linguistics, and dialect-aware language technology.