Leveraging Linguistic Similarity for Low-Resource Speech Transcription
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This study investigates how large-scale, self-supervised acoustic models (like XLSR and MMS) represent linguistic similarity and whether this can optimize Automatic Speech Recognition (ASR) for low-resource and dialectally diverse languages. While these models excel at cross-lingual transfer learning, their internal representations of fine-grained dialectal variation remain opaque. We focus on Yiddish, a language with a complex dialect continuum, to test if a model’s internal acoustic similarity metric—Acoustic Token Distribution Similarity (ATDS)—predicts ASR performance. Our methodology involved fine-tuning models on Yiddish dialects and measuring ATDS between Yiddish and related languages. Results confirm that ATDS is a meaningful predictor: higher acoustic similarity in the model’s latent space correlates with lower character error rates (CER) after fine-tuning. This relationship is strongest in mid-to-upper layers of the MMS model and for in-domain data. Crucially, ATDS captures model-dependent acoustic similarity, which does not always align with genealogical linguistic relationships but remains a practical indicator of transfer learning potential. We conclude that ATDS is a valuable tool for selecting donor languages to develop more efficient, dialect-sensitive ASR systems for language documentation, even if its absolute values require careful interpretation against linguistic knowledge.