A Recipe for Adapting Multilingual Embedders to OCR-Error Robustness and Historical Texts
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Modern multilingual text embedding models excel at semantic search on contemporary text but their performance degrades measurably on digitized historical documents. This issue is especially pronounced for underrepresented languages such as Luxembourgish, where historical materials combine evolving spelling conventions with OCR artifacts absent from standard training data. To address these challenges, we introduce OCR M-GTE, a pair of multilingual embedding models adapted for OCR robustness and historical texts, and show that the observed degradation can be mitigated through a simple multi-step training procedure tailored to historical variants and OCR noise. We evaluate the models on standard semantic search tasks, simulated OCR degradation, and genuine historical collections, observing consistent improvements under OCR-induced noise and on genuine historical data while maintaining comparable performance on clean modern text. Our ablation findings suggest that multilingual embedding models can be effectively adapted to perform robust cross-lingual search in heterogeneous European digitized corpora. We release our adapted models, code, and datasets under the AGPL-3.0 license: https://github.com/impresso/ocr-robust-multilingual-embeddings