Back to Main Conference 2018
LREC 2018main

Correction of OCR Word Segmentation Errors in Articles from the ACL Collection through Neural Machine Translation Methods

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/2gn4wbpg9mkw

Abstract

Depending on the quality of the original document, Optical Character Recognition (OCR) can produce a range of errors -- from erroneous letters to additional and spurious blank spaces. We applied a sequence-to-sequence machine translation system to correct word-segmentation OCR errors in scientific texts from the ACL collection with an estimated precision and recall above 0.95 on test data. We present the correction process and results.

Details

Paper ID
lrec2018-main-113
Pages
N/A
BibKey
nastase-hitschler-2018-correction
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • VN

    Vivi Nastase

  • JH

    Julian Hitschler

Links