HomeLREC 2026WorkshopsDIALRESlrec2026-ws-dialres-14
Back to DIALRES 2026
LREC 2026workshop

A Dialectal Corpus for Ukrainian: Collection, Classification, and Standardization

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

DOI:10.63317/4mkaru7y2op5

Abstract

Ukrainian dialects remain largely excluded from the digital linguistic landscape despite their active everyday use. We present a regional dialect corpus covering 18 administrative regions of Ukraine, compiled from digitized fieldwork collections and an online dialect atlas. The corpus comprises over 284,000 tokens of dialect text, annotated by region and partially accompanied by manually standardized translations. Using these resources, we investigate language identification and dialect-to-standard standardization. Baseline language identification yields an F-score of 0.75, rising to 0.99 with dialect-inclusive training. Dialect classification reaches 0.58, with confusion patterns reflecting known regional boundaries. For standardization, the best-performing LLM achieves a COMET score of 0.80, though BLEU scores remain low (0.21–0.23) across all models. We release the corpus, labelled datasets, model outputs, and reference translations to support future work on inclusive language technologies for non-standard varieties.

Details

Paper ID
lrec2026-ws-dialres-14
Pages
pp. 135-143
BibKey
frund-etal-2026-dialectal
Editors
Antonis Anastasopoulos, Stella Markantonatou, Angela Ralli, Marcos Zampieri, Stavros Bompolas, Vivian Stamou
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • YF

    Yuliia Frund

  • SA

    Sina Ahmadi

Links