Back to Main Conference 2026
LREC 2026main

Human-in-the-Loop Mass Transcription and Ground Truth Annotation for Challenging Historical Documents

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2m45sutdznvy

Abstract

Challenging historical documents still pose significant difficulties for fully automatic layout detection and text recognition, requiring lengthy, demanding correction. We describe our experiences with complex layouts and present our workflow with AdaptOCR, a web-based annotation tool designed to facilitate the efficient transcription and ground-truth annotation of demanding historical documents. Addressing the limitations of existing solutions, AdaptOCR prioritizes a streamlined workflow with an integrated trainable layout and OCR pipeline. The tool uses the PAGE standard to represent document structure and enables the annotation of baselines, regions, text lines and the correction of their transcriptions providing automatic OCR invocation and dictionary-based error detection. Furthermore, it supports flexible annotations with custom element types and attributes to cater to different project requirements. We demonstrate the effectiveness of the workflow and tool in two demanding applications: The transcription of a large corpus of historical printings and the detection / annotation of handwritten artifacts within the private library of the Grimm brothers. In addition, we evaluate the dictionary-based correction and assess the efficiency improvements using AdaptOCR in a pilot study.

Details

Paper ID
lrec2026-main-559
Pages
pp. 7023-7033
BibKey
fischer-etal-2026-human
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • NF

    Norbert Fischer

  • FP

    Frank Puppe

Links