Systematic Normalization of Spoken Mixed-Language, Mixed-Dialect Data
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Abstract
Literary transcriptions of spoken language often deviate from standard, written language. These variations can lead to higher than desirable error rates in NLP processing. This is particularly the case for spoken data of low resource varieties, including dialects and contact varieties of higher resource languages. This paper outlines a proposal for the systematic dialect-to-standard normalization of spoken language from language contact and dialect contact situations. This system is then tested on the Texas German Sample Corpus ( 13 hours), a set of audio and transcripts of Texas German conversations. Texas German is an umbrella term for a set of a heritage varieties of German spoken in Texas, USA that descend from multiple German dialects and that have been in contact with English for 150+ years. The proposed normalization system, along with the accompanying language-tagging system, can act as a starting point for other projects interested in normalizing their mixed variety data.