Towards a Bulgarian Historical Newspaper Corpus – Construction of Reading Order over the Text in Searchable PDFs

Proceedings of the First Workshop on Creating Interoperable Corpora of Historical Newspapers

Abstract

The determine the reading order of the text extracted from a searchable PDF produced by an OCR software from an old newspaper is the first task in the process of preparation of corpora of old newspapers. In the paper we present an algorithm for generation of reading order of black selected from the corresponding PDF. Also we performed a tuning of the parameters of the algorithm. The optimization provides 10 % improvement.