COCOA: Creation and Exploratory Investigation of a COrpus of Claims frOm NLP Articles

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Research articles are an essential pillar of scientific knowledge, but they are subject to multiple constraints. On the one hand, their scientific reliability is essential and relies in particular on the peer review process. On the other hand, they fulfill a rhetorical function of persuasion for authors who defend claims in a more and more competitive environment. In a context of massively increasing publication growth and quickly evolving practices, it is essential that the scientific community remains alert and critical of its own biases. In this paper, we call for a "NLP for NLP" framing of theseissues. We created COCOA, a corpus of sentences from NLP papers and pre-prints published in English between 1952 and 2024, a sample of which we manually annotated with claim category labels reflecting their rhetorical function. We fine-tuned a SciBERT model to predict remaining labels, and made both the corpus and the model available to the community. We illustrate the interest of the corpus with exploratory analyses, and outline directions for further research. We hope that this work can stimulate discussions on the issues of research standardization and scientific overclaiming.