Domain-Specific Considerations in the Preparation of Specialized Corpora: A Case Study on a Corpus of German Sermons
Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)
Abstract
We present a new corpus of contemporary German sermons and describe the steps taken in its preparation. We apply a semi-automatic approach to sentence segmentation, tokenization, and lemmatization, utilizing annotation guidelines that are specialized to this domain. In the process of preparing these data, we find that state-of-the-art tools for these tasks still make problematic errors, especially with non-standard data, despite apparently very high performance on common benchmarks. We obtain test scores of F1 = 96.69 % for sentence segmentation, F1 = 99.99 % for tokenization, and acc = 64.00 % for lemmatization with our domain-adapted models and show that domain-adaptation improves performance over state-of-the-art models for the token and sentence segmentation tasks.