A Dialectal Corpus for Ukrainian: Collection, Classification, and Standardization

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

Abstract

Ukrainian dialects remain largely excluded from the digital linguistic landscape despite their active everyday use. We present a regional dialect corpus covering 18 administrative regions of Ukraine, compiled from digitized fieldwork collections and an online dialect atlas. The corpus comprises over 284,000 tokens of dialect text, annotated by region and partially accompanied by manually standardized translations. Using these resources, we investigate language identification and dialect-to-standard standardization. Baseline language identification yields an F-score of 0.75, rising to 0.99 with dialect-inclusive training. Dialect classification reaches 0.58, with confusion patterns reflecting known regional boundaries. For standardization, the best-performing LLM achieves a COMET score of 0.80, though BLEU scores remain low (0.21–0.23) across all models. We release the corpus, labelled datasets, model outputs, and reference translations to support future work on inclusive language technologies for non-standard varieties.