CLARIAH-ES PressMint: Building Interoperable Corpora of Historical Press in Spain
Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers
Abstract
This paper describes CLARIAH-ES’s contribution to PressMint in Spain as a distributed effort across regional nodes (e.g., Catalonia, Madrid, Basque Country, Galicia, Canary Islands, Alicante), each developing manageable corpora in partnership with key repositories such as ARCA, Patrimonio Digital Complutense, Euskariana, Jable, Galiciana, and the BVMC periodicals portal. A central technical challenge is heterogeneous legacy OCR quality, motivating experiments with AI/LLM-assisted OCR renewal, normalization layers, and linguistic enrichment (e.g., NER and entity linking). This effort is situated alongside ongoing dissemination and the EOSC Mesh "historical newspapers" use-case work aimed at scalable discovery, access, and federated computation over interoperable historical press data.