HomeLREC 2026WorkshopsNSLPlrec2026-ws-nslp-12
Back to NSLP 2026
LREC 2026workshop

EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

Proceedings of Natural Scientific Language Processing (NSLP) @ LREC 2026

DOI:10.63317/4fpwcgmuqk8x

Abstract

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.

Details

Paper ID
lrec2026-ws-nslp-12
Pages
pp. 119-126
BibKey
jourdan-etal-2026-earlyscirev
Editors
Georg Rehm, Stefan Dietze, Danilo Dessi, Diana Maynard, Sonja Schimmler
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of Natural Scientific Language Processing (NSLP) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • LJ

    Léane Jourdan

  • JA

    Julien Aubert-Béduchaud

  • YC

    Yannis Chupin

  • MB

    Marah Baccari

  • FB

    Florian Boudin

Links