OTA-BOUN: A Historical Turkish Dependency Treebank

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

We present OTA-BOUN v2.0, the largest Universal Dependencies treebank for historical Turkish, consisting of 1,742 manually verified sentences sampled from late Ottoman texts. The annotation process followed a semi-automatic methodology: initial pre-annotation by the UDPipe 2.0 pipeline was refined through manual annotation of dependency relations, part-of-speech tags, and lemmas. A distinctive feature of OTA-BOUN is its dual-script representation: each sentence is provided both in the original Perso-Arabic script and its Latinized transcription, while tokens include aligned forms in both scripts. This dual-layer design enables research on script conversion, cross-lingual transfer, and historical–modern Turkish comparisons. Through detailed analyses on the aforementioned treebank, this study presents a unique and scalable resource, advancing computational studies of historical Turkish and supporting broader efforts in multilingual and diachronic NLP.