RuznamceNER: A Named Entity Recognition Dataset for Ottoman Turkish
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Named Entity Recognition (NER) in historical texts poses distinct challenges. Language change reflected in spelling variations, archaic vocabulary, and inconsistent orthography, diminish the efficacy of models trained on contemporary corpora. The limited availability of annotated historical datasets constrains the development and evaluation of accurate, domain-specific NER systems, underscoring the necessity for specialized approaches and domain adaptation. In this work, we introduce the ruznamçe registers as a valuable digital historical resource with broad potential for diverse NLP applications. Our primary contribution is RuznamceNER, a manually annotated NER dataset derived from ruznamçe documents spanning two centuries. The dataset contains 2,138 sentences and a total of 8,730 annotated entities of types PERSON, LOCATION and ORGANIZATION. We further report evaluation results using a BERT-CRF baseline model pre-trained with modern Turkish, highlighting the pivotal importance of in-domain training data for effective NER in historical contexts. Experimental results on the RuznamceNER test set under various training configurations show that even a small amount of supervised in-domain data can yield robust performance for well-structured texts, despite significant lexical and orthographic differences between historical and modern language forms