Back to Main Conference 2026
LREC 2026main

eSciBench: An Extensible Scientific PDF Extraction Benchmark

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4sxku4i2piqq

Abstract

Automatically extracting information from PDF documents (such as authors, affiliations, references, tables, equations) may be transformative in Digital Humanities where meta-data accompanying a document is typically manually collected, a cumbersome process. In this work, we conduct a systematic benchmarking of PDF extractors on a set of 100 scientific articles (1949 pages) of the STEM domain that have been processed automatically, then carefully curated. Our benchmark, named eSciBench is openly accessible. Putting to the test 13 extractors on it reveals that although some extractors perform well overall, extracting information from scientific articles is far from a solved problem.

Details

Paper ID
lrec2026-main-600
Pages
pp. 7568-7580
BibKey
taillon-etal-2026-escibench
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • NT

    Noah Tremblay Taillon

  • PL

    Phillippe Langlais

Links