Adaptive Method for Self-Supervised Learning Models on Automatic Dialect Speech Recognition Based on Shared Knowledge of Japanese Dialects and Standard Japanese

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Speech recognition for Japanese dialects is challenging, and recognition accuracy tends to be lower compared to standard Japanese. Previous research proposed a three-step learning method based on the self-supervised learning (SSL) model XLS-R as the base model, incorporating three multi-task learning tasks: SSL, ASR, and dialect identification (DID). While this achieved improved recognition performance for dialect speech, it faced the issue of degraded recognition performance for standard Japanese. This study proposes an adaptation method to construct a single speech recognition model, based on the prior model, that is suitable for both Japanese dialects and standard Japanese. We explored the use of diverse speech corpora, including ReazonSpeech based on TV broadcast audio and CEJC based on everyday conversational speech, in addition to the standard Japanese speech corpus CSJ and the dialect speech corpus COJADS used in prior research, aiming for knowledge sharing between dialects and standard Japanese. As a result, we confirmed improved recognition performance for both dialects and standard Japanese by including both in the final step of a three-step learning method. We also examined the impact of differences in corpus type and domain on recognition performance.