Back to Main Conference 2026
LREC 2026main

A Dataset for Evaluating ASR on Specialized Vocabulary

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/568sthbwdhap

Abstract

Evaluating the ability of Automatic Speech Recognition (ASR) models to transcribe specialized vocabulary remains a persistent challenge, as standard datasets predominantly feature common words and thus obscure weaknesses on rare or out-of-vocabulary (OOV) terms. To address this limitation, we introduce a linguistically curated bilingual dataset (English and Portuguese) comprising 13,846 utterances (18.7 hours) distributed across synthetic and literature-derived subsets, with OOV rates reaching up to 100%. We further propose a diagnostic evaluation framework that partitions recognition performance into Biased Word Error Rate (B-WER), targeting domain-specific jargon, and Unbiased Word Error Rate (U-WER), focusing on general vocabulary. Baseline evaluations using Whisper models (medium, large-v3, and large-v3-turbo) confirm the necessity of this framework. On the most challenging datasets, B-WER reaches 0.88–0.90, whereas U-WER remains as low as 0.06–0.19, demonstrating that conventional WER masks critical failure modes in jargon recognition. Additionally, an oracle upper bound experiment shows that providing correct jargon via prompting reduces B-WER by 0.50–0.70 absolute, quantifying the considerable potential for contextual biasing. We release the datasets and evaluation scripts as a reproducible benchmark to foster research on domain-aware contextual biasing and OOV handling in ASR systems.

Details

Paper ID
lrec2026-main-032
Pages
pp. 470-480
BibKey
klering-etal-2026-dataset
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • EK

    Emily Haubert Klering

  • EC

    Eduardo Gabriel Cortes

  • TC

    Tatjana Chernenko

  • MT

    Mariana Vargas Trarbach

  • GR

    Gabriel de Oliveira Ramos

  • SR

    Sandro José Rigo

  • MD

    Maitê Dupont

  • AV

    Ana Luiza Treichel Vianna

  • GS

    Gabriela Krause dos Santos

  • VP

    Vinicius Meirelles Pereira

  • DA

    Denis Andrei de Araujo

  • RK

    Rafael Kunst

Links