HomeLREC 2026WorkshopsSLIDElrec2026-ws-slide-19
Back to SLIDE 2026
LREC 2026workshop

Domain-Specific Considerations in the Preparation of Specialized Corpora: A Case Study on a Corpus of German Sermons

Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

DOI:10.63317/5nhkpqtxbr6i

Abstract

We present a new corpus of contemporary German sermons and describe the steps taken in its preparation. We apply a semi-automatic approach to sentence segmentation, tokenization, and lemmatization, utilizing annotation guidelines that are specialized to this domain. In the process of preparing these data, we find that state-of-the-art tools for these tasks still make problematic errors, especially with non-standard data, despite apparently very high performance on common benchmarks. We obtain test scores of F1 = 96.69 % for sentence segmentation, F1 = 99.99 % for tokenization, and acc = 64.00 % for lemmatization with our domain-adapted models and show that domain-adaptation improves performance over state-of-the-art models for the token and sentence segmentation tasks.

Details

Paper ID
lrec2026-ws-slide-19
Pages
pp. 212-223
BibKey
haiber-etal-2026-domain
Editors
Germany) Erhard Hinrichs (Tübingen University, Sweden) Joakim Nivre (Uppsala University, Bulgaria) Petya Osenova (Sofia University, USA) James Pustejovsky (Brandeis University, Germany) Claus Zinn (Tübingen University
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • CH

    Cora Haiber

  • AR

    Adam Roussel

  • SD

    Stefanie Dipper

Links