eSciBench: An Extensible Scientific PDF Extraction Benchmark
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Automatically extracting information from PDF documents (such as authors, affiliations, references, tables, equations) may be transformative in Digital Humanities where meta-data accompanying a document is typically manually collected, a cumbersome process. In this work, we conduct a systematic benchmarking of PDF extractors on a set of 100 scientific articles (1949 pages) of the STEM domain that have been processed automatically, then carefully curated. Our benchmark, named eSciBench is openly accessible. Putting to the test 13 extractors on it reveals that although some extractors perform well overall, extracting information from scientific articles is far from a solved problem.