Back to Main Conference 2012
LREC 2012main

Automatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC

Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012)

DOI:10.63317/4rsiajqwz8sg

Abstract

This paper presents the Tübingen Baumbank des Deutschen Diachron (TüBa-D/DC), a linguistically annotated corpus of selected diachronic materials from the German Gutenberg Project. It was automatically annotated by a suite of NLP tools integrated into WebLicht, the linguistic chaining tool used in CLARIN-D. The annotation quality has been evaluated manually for a subcorpus ranging from Middle High German to Modern High German. The integration of the TüBa-D/DC into the CLARIN-D infrastructure includes metadata provision and harvesting as well as sustainable data storage in the Tübingen CLARIN-D center. The paper further provides an overview of the possibilities of accessing the TüBa-D/DC data. Methods for full-text search of the metadata and object data and for annotation-based search of the object data are described in detail. The WebLicht Service Oriented Architecture is used as an integrated environment for annotation based search of the TüBa-D/DC. WebLicht thus not only serves as the annotation platform for the TüBa-D/DC, but also as a generic user interface for accessing and visualizing it.

Details

Paper ID
lrec2012-main-033
Pages
pp. 1622-1627
BibKey
hinrichs-zastrow-2012-automatic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-7-7
Conference
Eighth International Conference on Language Resources and Evaluation
Location
Istanbul, Turkey
Date
21 May 2012 27 May 2012

Authors

  • EH

    Erhard Hinrichs

  • TZ

    Thomas Zastrow

Links