HomeLREC 2026WorkshopsPRESSMINTlrec2026-ws-pressmint-10
Back to PRESSMINT 2026
LREC 2026workshop

Towards a Bulgarian Historical Newspaper Corpus – Construction of Reading Order over the Text in Searchable PDFs

Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers

DOI:10.63317/336sd7ixv3fw

Abstract

The determine the reading order of the text extracted from a searchable PDF produced by an OCR software from an old newspaper is the first task in the process of preparation of corpora of old newspapers. In the paper we present an algorithm for generation of reading order of black selected from the corresponding PDF. Also we performed a tuning of the parameters of the algorithm. The optimization provides 10 % improvement.

Details

Paper ID
lrec2026-ws-pressmint-10
Pages
pp. 56-64
BibKey
paev-etal-2026-bulgarian
Editors
Maciej Ogrodniczuk, Petya Osenova, Tanja Wissik
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • NP

    Nikolay Paev

  • SM

    Stefan Marinov

  • IK

    Ivan Kratchanov

  • PO

    Petya Osenova

  • KS

    Kiril Simov

Links