Back to Main Conference 2014
LREC 2014main

Named Entity Tagging a Very Large Unbalanced Corpus: Training and Evaluating NE Classifiers

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/4peap9ds2n8f

Abstract

We describe a systematic and application-oriented approach to training and evaluating named entity recognition and classification (NERC) systems, the purpose of which is to identify an optimal system and to train an optimal model for named entity tagging DeReKo, a very large general-purpose corpus of contemporary German (Kupietz et al., 2010). DeReKo ‘s strong dispersion wrt. genre, register and time forces us to base our decision for a specific NERC system on an evaluation performed on a representative sample of DeReKo instead of performance figures that have been reported for the individual NERC systems when evaluated on more uniform and less diverse data. We create and manually annotate such a representative sample as evaluation data for three different NERC systems, for each of which various models are learnt on multiple training data. The proposed sampling method can be viewed as a generally applicable method for sampling evaluation data from an unbalanced target corpus for any sort of natural language processing.

Details

Paper ID
lrec2014-main-728
Pages
N/A
BibKey
bingel-haider-2014-named
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • JB

    Joachim Bingel

  • TH

    Thomas Haider

Links