HomeLREC 2026WorkshopsDIALRESlrec2026-ws-dialres-06
Back to DIALRES 2026
LREC 2026workshop

Systematic Normalization of Spoken Mixed-Language, Mixed-Dialect Data

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

DOI:10.63317/3bv9dmxr24p6

Abstract

Literary transcriptions of spoken language often deviate from standard, written language. These variations can lead to higher than desirable error rates in NLP processing. This is particularly the case for spoken data of low resource varieties, including dialects and contact varieties of higher resource languages. This paper outlines a proposal for the systematic dialect-to-standard normalization of spoken language from language contact and dialect contact situations. This system is then tested on the Texas German Sample Corpus ( 13 hours), a set of audio and transcripts of Texas German conversations. Texas German is an umbrella term for a set of a heritage varieties of German spoken in Texas, USA that descend from multiple German dialects and that have been in contact with English for 150+ years. The proposed normalization system, along with the accompanying language-tagging system, can act as a starting point for other projects interested in normalizing their mixed variety data.

Details

Paper ID
lrec2026-ws-dialres-06
Pages
pp. 58-69
BibKey
blevins-2026-systematic
Editors
Antonis Anastasopoulos, Stella Markantonatou, Angela Ralli, Marcos Zampieri, Stavros Bompolas, Vivian Stamou
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • MB

    Margaret Blevins

Links