Phonetic-based Ranking for Improved Pseudo-Labeling in Low-Resource ASR
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
The rise of large language models has boosted speech and language technologies; however, where transcripts of audio data are limited, the performance of current technology is not yet satisfactory. One common strategy to tackle data scarcity is leveraging pseudo-labels, for example automatically transcribing data with a pre-trained ASR. One critical issue of this approach is assessing the quality of the automatic transcriptions, that may be rather bad for low-resourced languages. While several filtering approaches exist in literature, they typically work with decent pre-trained ASR models but may fail otherwise. In this work we propose a phonetic-based ranking, enabling an effective selection with controllable computational resources; the resulting subset of pseudo-labels serves as additional material for fine-tuning the source ASR models. Experiments on common benchmarks in three low-resource languages demonstrate the effectiveness of the proposed approach, yielding up to a 3-point reduction in WER.