HomeLREC 2026WorkshopsBUCClrec2026-ws-bucc-09
Back to BUCC 2026
LREC 2026workshop

Bi-Text Mining across German Dialects: On the Role of Synthetic Training Data for Dialect Adaptation

Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)

DOI:10.63317/3gmqhegz45cn

Abstract

Cross-dialect bi-text mining relies on robust multilingual sentence representations to identify semantically equivalent sentence pairs across languages. While recent multilingual bi-encoder models achieve strong performance on standardized written languages, their behavior on dialectal varieties is largely unknown. In this study, we use Tatoeba to evaluate the performance of four widely-used bi-encoders on dialect-to-standard German translation retrieval, covering German documents and queries written in three dialects: Low German, Bavarian, and Alemannic. Motivated by the lack of resources, we examine the extent to which synthetic translations (from dictionaries and large language models; LLMs) can serve as weak supervision for dialect adaptation. Our results reveal that bi-encoders, when applied in a zero-shot setting, exhibit deficiencies in capturing semantic similarity between German and dialects, while fine-tuning on synthetic data substantially improves their retrieval effectiveness, with larger gains obtained from LLM-translated training data. We further analyze retrieval performance on Bavarian across varying dialect word proportions and observe a drop when dialect words make up more than 60% of the text.

Details

Paper ID
lrec2026-ws-bucc-09
Pages
pp. 72-83
BibKey
wang-etal-2026-bi
Editors
Reinhard Rapp, Ayla Rigouts Terryn, Serge Sharoff, Pierre Zweigenbaum
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • JW

    Jing Wang

  • BP

    Barbara Plank

  • RL

    Robert Litschko

Links