Systematic Normalization of Spoken Mixed-Language, Mixed-Dialect Data

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

Abstract

Literary transcriptions of spoken language often deviate from standard, written language. These variations can lead to higher than desirable error rates in NLP processing. This is particularly the case for spoken data of low resource varieties, including dialects and contact varieties of higher resource languages. This paper outlines a proposal for the systematic dialect-to-standard normalization of spoken language from language contact and dialect contact situations. This system is then tested on the Texas German Sample Corpus ( 13 hours), a set of audio and transcripts of Texas German conversations. Texas German is an umbrella term for a set of a heritage varieties of German spoken in Texas, USA that descend from multiple German dialects and that have been in contact with English for 150+ years. The proposed normalization system, along with the accompanying language-tagging system, can act as a starting point for other projects interested in normalizing their mixed variety data.