Back to Main Conference 2026
LREC 2026main

To Overfit or Not to Overfit? An Evaluation of HTR Workflow on 17Th-18Th Century French Corpus

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/23hc8mveght2

Abstract

This paper presents the results of an evaluation of general Handwritten Text Recognition (HTR) models applied to 17th and 18th century corpus written in modern French and the fine-tuning of the models. Our aim was to transcribe a corpus from this period using existing pre-trained models and to assess their performance on such data. While these general models offer a large linguistic coverage, our results demonstrate they are often insufficiently adapted to the specific handwriting nuances and orthographic inconsistencies of early modern French. To improve the results, we fine-tuned a base model to develop a specialized version trained on our dataset. Although the model still encountered difficulties due to highly variable handwriting styles, it significantly improved transcription accuracy and reduced processing time. Following this step, we used a semi-automatic post-correction tool to address remaining errors and integrated Named Entity Recognition (NER) steps for automated TEI-XML encoding. This paper discusses the evaluation results of both the HTR and NER models, and how the overfitting allows to get better transcriptions on a specific corpus.

Details

Paper ID
lrec2026-main-078
Pages
pp. 1009-1016
BibKey
tiger-2026-overfit
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MT

    Marine Tiger

Links