Back to Main Conference 2026
LREC 2026main

Phonetic-based Ranking for Improved Pseudo-Labeling in Low-Resource ASR

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/338dnb8n7e85

Abstract

The rise of large language models has boosted speech and language technologies; however, where transcripts of audio data are limited, the performance of current technology is not yet satisfactory. One common strategy to tackle data scarcity is leveraging pseudo-labels, for example automatically transcribing data with a pre-trained ASR. One critical issue of this approach is assessing the quality of the automatic transcriptions, that may be rather bad for low-resourced languages. While several filtering approaches exist in literature, they typically work with decent pre-trained ASR models but may fail otherwise. In this work we propose a phonetic-based ranking, enabling an effective selection with controllable computational resources; the resulting subset of pseudo-labels serves as additional material for fine-tuning the source ASR models. Experiments on common benchmarks in three low-resource languages demonstrate the effectiveness of the proposed approach, yielding up to a 3-point reduction in WER.

Details

Paper ID
lrec2026-main-795
Pages
pp. 10130-10139
BibKey
matassoni-etal-2026-phonetic
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MM

    Marco Matassoni

  • RG

    Roberto Gretter

  • FD

    Falavigna Daniele

  • MN

    Mohamed Nabih Ali Mohamed Nawar

  • AB

    Alessio Brutti

  • MN

    Matteo Negri

  • MC

    Mauro Cettolo

  • MG

    Marco Gaido

  • SP

    Sara Papi

  • LB

    Luisa Bentivogli

Links