Investigating the Role of Synthetic Data Augmentation and Training Strategies on Improving Low-Resource Language ASR
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Low-resource automatic speech recognition (ASR) is challenging due to a scarcity of annotated data. While synthetic data from text-to-speech (TTS) systems can augment ASR training, its efficacy for low-resource languages remains unclear. In this study, we investigate under which conditions TTS-based data augmentation is most effective for low-resource languages. Experiments on six low-resource languages in Common Voice show that synthetic data is most beneficial under extremely low-resource ASR conditions (i.e., less than one hour of available real speech data), or for languages with larger amounts of TTS data (i.e., more than 10 hours). Additionally, increasing the amount and diversity of synthetic data while keeping an appropriate ratio of synthetic-to-real data can further improve ASR performance.