Back to Main Conference 2018
LREC 2018main
Correction of OCR Word Segmentation Errors in Articles from the ACL Collection through Neural Machine Translation Methods
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Abstract
Depending on the quality of the original document, Optical Character Recognition (OCR) can produce a range of errors -- from erroneous letters to additional and spurious blank spaces. We applied a sequence-to-sequence machine translation system to correct word-segmentation OCR errors in scientific texts from the ACL collection with an estimated precision and recall above 0.95 on test data. We present the correction process and results.