To Overfit or Not to Overfit? An Evaluation of HTR Workflow on 17Th-18Th Century French Corpus
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This paper presents the results of an evaluation of general Handwritten Text Recognition (HTR) models applied to 17th and 18th century corpus written in modern French and the fine-tuning of the models. Our aim was to transcribe a corpus from this period using existing pre-trained models and to assess their performance on such data. While these general models offer a large linguistic coverage, our results demonstrate they are often insufficiently adapted to the specific handwriting nuances and orthographic inconsistencies of early modern French. To improve the results, we fine-tuned a base model to develop a specialized version trained on our dataset. Although the model still encountered difficulties due to highly variable handwriting styles, it significantly improved transcription accuracy and reduced processing time. Following this step, we used a semi-automatic post-correction tool to address remaining errors and integrated Named Entity Recognition (NER) steps for automated TEI-XML encoding. This paper discusses the evaluation results of both the HTR and NER models, and how the overfitting allows to get better transcriptions on a specific corpus.