Back to Main Conference 2016
LREC 2016main

SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/572p7c4fkctw

Abstract

This paper presents SciCorp, a corpus of full-text English scientific papers of two disciplines, genetics and computational linguistics. The corpus comprises co-reference and bridging information as well as information status labels. Since SciCorp is annotated with both labels and the respective co-referent and bridging links, we believe it is a valuable resource for NLP researchers working on scientific articles or on applications such as co-reference resolution, bridging resolution or information status classification. The corpus has been reliably annotated by independent human coders with moderate inter-annotator agreement (average kappa = 0.71). In total, we have annotated 14 full papers containing 61,045 tokens and marked 8,708 definite noun phrases. The paper describes in detail the annotation scheme as well as the resulting corpus. The corpus is available for download in two different formats: in an offset-based format and for the co-reference annotations in the widely-used, tabular CoNLL-2012 format.

Details

Paper ID
lrec2016-main-275
Pages
pp. 1743-1749
BibKey
roesiger-2016-scicorp
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • IR

    Ina Roesiger

Links