HomeLREC 2026WorkshopsPRESSMINTlrec2026-ws-pressmint-01
Back to PRESSMINT 2026
LREC 2026workshop

PressMint: Towards Interoperable Corpora of Historical Newspapers

Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers

DOI:10.63317/55w7588oukim

Abstract

This paper presents Project X (name anonymized for review), an ongoing initiative to compile a multilingual, comparable, annotated, translated, and interoperable collection of European historical newspaper corpora. Spanning 17 countries and covering 15 languages, the project addresses a key shortcoming of existing newspaper resources: their lack of interoperability, which limits cross-lingual and transnational research. Building on the infrastructure and experience of the ParlaMint projects, the project adapts established encoding guidelines, validation workflows, and open-source tools to historical newspaper data. We outline the overall project architecture, the corpus encoding scheme, and the GitHub-based framework supporting collaborative development and quality control. The paper further describes the sample linguistic annotation pipeline, including OCR correction, text normalisation, and annotation within the Universal Dependencies framework, with attention to challenges posed by historical language varieties. The resulting FAIR, openly available corpora are intended to support comparative, diachronic research across the humanities and social sciences.

Details

Paper ID
lrec2026-ws-pressmint-01
Pages
pp. 1-5
BibKey
erjavec-etal-2026-pressmint
Editors
Maciej Ogrodniczuk, Petya Osenova, Tanja Wissik
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • TE

    Tomaž Erjavec

  • MK

    Matyáš Kopp

  • MO

    Maciej Ogrodniczuk

  • PO

    Petya Osenova

  • GR

    German Rigau

Links