Back to Main Conference 2014
LREC 2014main

NoSta-D Named Entity Annotation for German: Guidelines and Dataset

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/4kdtmaceeavd

Abstract

We describe the annotation of a new dataset for German Named Entity Recognition (NER). The need for this dataset is motivated by licensing issues and consistency issues of existing datasets. We describe our approach to creating annotation guidelines based on linguistic and semantic considerations, and how we iteratively refined and tested them in the early stages of annotation in order to arrive at the largest publicly available dataset for German NER, consisting of over 31,000 manually annotated sentences (over 591,000 tokens) from German Wikipedia and German online news. We provide a number of statistics on the dataset, which indicate its high quality, and discuss legal aspects of distributing the data as a compilation of citations. The data is released under the permissive CC-BY license, and will be fully available for download in September 2014 after it has been used for the GermEval 2014 shared task on NER. We further provide the full annotation guidelines and links to the annotation tool used for the creation of this resource.

Details

Paper ID
lrec2014-main-251
Pages
pp. 2524-2531
BibKey
benikova-etal-2014-nosta
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • DB

    Darina Benikova

  • CB

    Chris Biemann

  • MR

    Marc Reznicek

Links