Back to DIALRES 2026
LREC 2026workshop
Sociolinguistic aspects of crowdsourcing for a vocal corpus of Alsatian
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Abstract
Alsatian is a regional low-resource language spoken in a majority-language context. In order to create a voice dataset suited for training automatic speech recognition and speech-to-text models, we launched a crowdsourcing campaign on the platform Mozilla Common Voice. We describe sociolinguistic issues we ran into, such as participants’ perception of their own language and its role in the AI landscape, which are vital to address to raise the participation in the crowdsourcing effort. We found that the participants are often confused about NLP and AI tools, and have a strong interested in preserving their language.