Automatic Generation of Graded Texts in Old Church Slavonic
Proceedings of the 2nd Workshop on Evaluating Text Difficulty in a Multilingual Context (DeTermIt! 2026)
Abstract
In the past few decades, graded readers have been valued within language education and have so much as extended onto the so-called classical (or ‘dead’) languages, such as Latin and Greek. The immersive reading and listening of adapted texts in these languages has been shown to increase students’ proficiency, independence and motivation. However, as of now there is only a small number of related resources as well as of classical languages represented. The present study will investigate the current potential for (semi-)automatic generation of adapted classical-language readers while focusing on the Old Church Slavonic language. From a Natural Language Processing (NLP) point of view, work with the language is challenging due to the variety of dialects and diachronic variations it encompasses. The following steps are taken within our study: 1) Representative measurable characteristics of professional classical-language readers, such as the Latin Lingua latina per se illustrata and the Greek Athenaze, are analysed. 2) Automatic generation of adapted Old Church Slavonic text is attempted through the use of a sequence-to-sequence model (mT5) as well as a Large Language Model (GPT-5) in a one-shot setting. 3) The derived texts’ quality is assessed through both human evaluation and a comparison of their textual characteristics with those of professional texts as defined in point 1). The edited versions of the GPT-based texts are shared for future reference and use.