Back to Main Conference 2026
LREC 2026main

Scientific Article Section Classification (SASC) Dataset

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3rdo9r4az3iy

Abstract

We introduce a novel, publicly available dataset of scientific publications specifically designed to focused on the structural and semantic analysis of their full texts. This collection comprises 4,896 scholarly articles processed using GROBID and self-defined parsers for its segmentation and section parsing. To ensure broad utility and diversity, the dataset includes (≈1,000) papers from 4 specialized research areas: Energy, Cancer, Neuroscience, and Transportation, supplemented by an additional ≈1,000 papers randomly selected from general scientific domains. This dataset is annotated using a newly-defined hierarchical taxonomy comprising 2 levels: the first level contains 9 semantic classes (coarse-grained), while the second level contains 47 semantic classes (fine-grained). All source documents were ethically and legally sourced via OpenAIRE, and the corpus is restricted exclusively to content available under open licenses. License verification was performed through cross-referencing publisher metadata, landing pages, and the Unpaywall database. This curated dataset provides a robust and domain-diverse resource, ideal for developing and evaluating NLP models that require training on hierarchical structure of scientific literature.

Details

Paper ID
lrec2026-main-587
Pages
pp. 7415-7422
BibKey
duransilva-etal-2026-scientific
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • ND

    Nicolau Duran-Silva

  • JM

    Julian Moreno-Schneider

  • CP

    César Parra-Rojas

  • GR

    Georg Rehm

Links