Adapting Pretrained Models to Endangered Languages in Japan: A Comparative Study on Ryukyuan and Ainu Speech Recognition
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We investigate high-accuracy and speaker-robust automatic speech recognition (ASR) models by leveraging pretrained models for endangered languages in Japan — Ryukyuan (Shuri dialect) and Ainu (Saru dialect) — to support language and cultural preservation. In particular, this study presents the first experimental study on building and evaluating an ASR model for the Ryukyuan language. Specifically, we compare existing multilingual pretrained models, Whisper and XLS-R, with our in-house Japanese-focused model (JP-90k) pretrained solely on a large-scale weakly-supervised Japanese dataset. These models were fine-tuned on up to 10 and 32 hours of Ryukyuan and Ainu data, respectively. As a result, JP-90k consistently outperformed other models of the similar size in both languages. In addition, it demonstrated a remarkable advantage when training data was very limited, i.e., an hour or less. These findings suggest that large-scale pretraining on a language closely related to the target ones can yield robust low-resource ASR, including for unseen speakers and out-of-domain conditions. Furthermore, we found that all pretrained models achieved convergence in ASR accuracy with as little as 3-5 hours of fine-tuning data for both languages.