Back to PRESSMINT 2026
LREC 2026workshop
Towards a Bulgarian Historical Newspaper Corpus – Construction of Reading Order over the Text in Searchable PDFs
Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers
Abstract
The determine the reading order of the text extracted from a searchable PDF produced by an OCR software from an old newspaper is the first task in the process of preparation of corpora of old newspapers. In the paper we present an algorithm for generation of reading order of black selected from the corresponding PDF. Also we performed a tuning of the parameters of the algorithm. The optimization provides 10 % improvement.