Back to Main Conference 2026
LREC 2026main

CoMMA, a Large-scale Corpus of Multilingual Medieval Archives

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/5pjzh8ma5v76

Abstract

We present CoMMA, a large-scale corpus of medieval manuscripts produced through automatic text recognition. The corpus contains around 2.5b tokens drawn from more than 23,000 digitized manuscripts in Latin and Old French, harvested via IIIF. Unlike other resources, it is made of raw, non-normalized text enriched with layout analysis in various formats. We describe the pipeline used for large-scale acquisition and processing, and report quantitative and qualitative evaluations (average CER 9.7%). The resulting resource supports multiple use cases, from pretraining language models to corpus linguistic on historical languages and digital humanities applications.

Details

Paper ID
lrec2026-main-560
Pages
pp. 7034-7045
BibKey
clrice-etal-2026-comma
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • TC

    Thibault Clérice

  • SG

    Simon Gabay

  • MV

    Malamatenia Vlachou-Efsthatiou

  • AP

    Ariane Pinche

  • BS

    Benoît Sagot

Links