HomeLREC 2026WorkshopsPRESSMINTlrec2026-ws-pressmint-04
Back to PRESSMINT 2026
LREC 2026workshop

PressMint-PT - Compiling a Portuguese Historical Newspaper Corpus

Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers

DOI:10.63317/2os9smose6kj

Abstract

We present a new European Portuguese corpus of newspapers from the 19th and early 20th centuries, integrated in the recent PressMint project, whose goal is to provide a set of comparable newspaper corpora for European languages in that time frame. We discuss the raw data that was previously available, as well as new data specifically compiled for the project, and the challenges involving OCR, text recognition, and different orthographical norms. We describe the pipeline setup for XML encoding and annotation, partially based on work developed for the ParlaMint corpora. The corpus is currently under development and will be made freely available at the end of the project, as part of the PressMint corpora.

Details

Paper ID
lrec2026-ws-pressmint-04
Pages
pp. 16-20
BibKey
aires-etal-2026-pressmint
Editors
Maciej Ogrodniczuk, Petya Osenova, Tanja Wissik
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • JA

    Jose Aires

  • AM

    Amália Mendes

Links