Back to Main Conference 2022
LREC 2022main

A Language Modelling Approach to Quality Assessment of OCR’ed Historical Text

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/3kd8n7srb9vx

Abstract

We hypothesise and evaluate a language model-based approach for scoring the quality of OCR transcriptions in the British Library Newspapers (BLN) corpus parts 1 and 2, to identify the best quality OCR for use in further natural language processing tasks, with a wider view to link individual newspaper reports of crime in nineteenth-century London to the Digital Panopticon—a structured repository of criminal lives. We mitigate the absence of gold standard transcriptions of the BLN corpus by utilising a corpus of genre-adjacent texts that capture the common and legal parlance of nineteenth-century London—the Proceedings of the Old Bailey Online—with a view to rank the BLN transcriptions by their OCR quality.

Details

Paper ID
lrec2022-main-630
Pages
pp. 5859-5864
BibKey
booth-etal-2022-language
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • CB

    Callum Booth

  • RS

    Robert Shoemaker

  • RG

    Robert Gaizauskas

Links