Back to Main Conference 2026
LREC 2026main

A Recipe for Adapting Multilingual Embedders to OCR-Error Robustness and Historical Texts

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/29rfx5wcyz3z

Abstract

Modern multilingual text embedding models excel at semantic search on contemporary text but their performance degrades measurably on digitized historical documents. This issue is especially pronounced for underrepresented languages such as Luxembourgish, where historical materials combine evolving spelling conventions with OCR artifacts absent from standard training data. To address these challenges, we introduce OCR M-GTE, a pair of multilingual embedding models adapted for OCR robustness and historical texts, and show that the observed degradation can be mitigated through a simple multi-step training procedure tailored to historical variants and OCR noise. We evaluate the models on standard semantic search tasks, simulated OCR degradation, and genuine historical collections, observing consistent improvements under OCR-induced noise and on genuine historical data while maintaining comparable performance on clean modern text. Our ablation findings suggest that multilingual embedding models can be effectively adapted to perform robust cross-lingual search in heterogeneous European digitized corpora. We release our adapted models, code, and datasets under the AGPL-3.0 license: https://github.com/impresso/ocr-robust-multilingual-embeddings

Details

Paper ID
lrec2026-main-071
Pages
pp. 929-935
BibKey
michail-etal-2026-recipe
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AM

    Andrianos Michail

  • SP

    Stylianos Psychias

  • JO

    Juri Opitz

  • SC

    Simon Clematide

Links