TransVar – the Corpus for Variation and Change Study of the Historical Transcarpathian lects

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

Abstract

The paper introduces TransVar – the corpus of the historical Transcarpathian lects (the first half of the XX century, the territories of modern Ukraine, Poland, Slovakia, and Romania). The corpus contains data from Lemko, Bojko and Hutsul small territorial lect groups. It is crucial for studies of the people of these territories, who witnessed forceful deportation from their homeland in the 1940s – 1950s, soon after the recordings were made (1920s – 1930s). The article also provides a brief overview of their linguistic properties, as evident in the material. The corpus is morphosyntactically tagged. It contains data on part-of-speech, morphological features, lemmata and syntactical dependencies. The study stresses the crux of manual analysis of the errata made in an automatic tagging phase for further improvement. The supplementary information includes named entities encountered in the text and the basic vocabulary. All the texts are accompanied by metalinguistic information, required for the sociolinguistic study. After the analysis of the current stage of the corpus creation, the article outlines further research prospects. Apart from more thorough manual annotation, one of the prospects is to add English translation with the purpose of making the material more accessible to scholars without a background in Slavic studies.