HomeLREC 2026WorkshopsSLIDElrec2026-ws-slide-07
Back to SLIDE 2026
LREC 2026workshop

Gutenberg+: A More Temporally Faithful Corpus for Diachronic NLP

Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

DOI:10.63317/2kjofgrkkbt9

Abstract

We introduce Gutenberg+, a temporally more faithful version of the Project Gutenberg (PG) corpus, one of the most widely used resources for diachronic text analysis. Despite its popularity, the PG corpus contains a major yet overlooked flaw: around 15% of its entries are collections (e.g., anthologies of books, letters, or poems) rather than atomic works, which distorts temporal analyses since such collections may span multiple decades. We present an automatic method to detect and split these collections into their constituent works, producing a finer-grained and temporally consistent corpus. We further re-annotate publication years using LLM-based retrieval-augmented generative methods, demonstrating the potential of LLMs to enhance structured linguistic resources. To illustrate the utility of Gutenberg+, we conduct a small-scale diachronic case study on negation, showing that our refined corpus captures more nuanced cross-linguistic variation than the original PG data. Finally, we release the corpus in UIMA format with full metadata and linguistic annotations, providing a standardized resource for future research on diachronic language change.

Details

Paper ID
lrec2026-ws-slide-07
Pages
pp. 86-92
BibKey
hammerla-etal-2026-gutenberg
Editors
Germany) Erhard Hinrichs (Tübingen University, Sweden) Joakim Nivre (Uppsala University, Bulgaria) Petya Osenova (Sofia University, USA) James Pustejovsky (Brandeis University, Germany) Claus Zinn (Tübingen University
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • LH

    Leon Hammerla

  • AM

    Alexander Mehler

Links