Domain-Specific Considerations in the Preparation of Specialized Corpora: A Case Study on a Corpus of German Sermons

Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

Abstract

We present a new corpus of contemporary German sermons and describe the steps taken in its preparation. We apply a semi-automatic approach to sentence segmentation, tokenization, and lemmatization, utilizing annotation guidelines that are specialized to this domain. In the process of preparing these data, we find that state-of-the-art tools for these tasks still make problematic errors, especially with non-standard data, despite apparently very high performance on common benchmarks. We obtain test scores of F1 = 96.69 % for sentence segmentation, F1 = 99.99 % for tokenization, and acc = 64.00 % for lemmatization with our domain-adapted models and show that domain-adaptation improves performance over state-of-the-art models for the token and sentence segmentation tasks.

Resources

Details

Paper ID

lrec2026-ws-slide-19

Pages

pp. 212-223

DOI

10.63317/5nhkpqtxbr6i

BibKey

haiber-etal-2026-domain

Editors

Germany) Erhard Hinrichs (Tübingen University, Sweden) Joakim Nivre (Uppsala University, Bulgaria) Petya Osenova (Sofia University, USA) James Pustejovsky (Brandeis University, Germany) Claus Zinn (Tübingen University

Publisher

European Language Resources Association (ELRA)

ISSN

N/A

ISBN

N/A

Workshop

Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

Location

Palma, Mallorca, Spain

Date

11 - 16 May 2026

Authors

CH
Cora Haiber
AR
Adam Roussel
SD
Stefanie Dipper

Links

URL

DOI