HomeLREC 2026WorkshopsDIALRESlrec2026-ws-dialres-02
Back to DIALRES 2026
LREC 2026workshop

Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

DOI:10.63317/2xvbjjktadtu

Abstract

Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset’s characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware TTS, which is proposed in the form of zero-shot and fine-tuned model adaptation in low-resource scenarios.

Details

Paper ID
lrec2026-ws-dialres-02
Pages
pp. 12-23
BibKey
oberkircher-etal-2026-saar
Editors
Antonis Anastasopoulos, Stella Markantonatou, Angela Ralli, Marcos Zampieri, Stavros Bompolas, Vivian Stamou
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • LO

    Lena Sophie Oberkircher

  • JA

    Jesujoba Alabi

  • DK

    Dietrich Klakow

  • JT

    Jürgen Trouvain

Links