EuReCo, KorAP and DeReKo: Updates on Ingestion and Annotation Pipelines, Backend, Interfaces, Operation, and Corpora

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

Abstract

This paper reports on recent technical developments in the European Reference Corpus EuReCo and its current technical implementation based on the corpus search and analysis platform KorAP. We describe updates to the ingestion pipeline, including extensions to the TEI-to-KorAP-XML converter tei2korapxml and the KorAP tokenizer, as well as the newly introduced korapxmltool for annotation and index conversion. We further present Koral-Mapper, a service that enables cross-schema comparability of annotations and metadata at query time, and report on developments in the backend access control system Kustvakt, the web user interface Kalamar, API client libraries for R and Python that promote reproducibility and methodologically sound AI-assisted analysis, and containerized deployment. The corpora and languages currently represented in EuReCo are outlined, and the role of the German Reference Corpus DeReKo, including its metadata-driven virtual corpus design, predefined useful subcorpora, and TEI encoding, is discussed in detail. We further present the National Libraries as Corpus approach and DeLiKo-2025@DNB as its first full-scale proof of concept, and discuss the potential of this approach for extending EuReCo with comparable contemporary fiction corpora across European countries.