PressMint-PT - Compiling a Portuguese Historical Newspaper Corpus
Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers
Abstract
We present a new European Portuguese corpus of newspapers from the 19th and early 20th centuries, integrated in the recent PressMint project, whose goal is to provide a set of comparable newspaper corpora for European languages in that time frame. We discuss the raw data that was previously available, as well as new data specifically compiled for the project, and the challenges involving OCR, text recognition, and different orthographical norms. We describe the pipeline setup for XML encoding and annotation, partially based on work developed for the ParlaMint corpora. The corpus is currently under development and will be made freely available at the end of the project, as part of the PressMint corpora.