A Dataset for Evaluating ASR on Specialized Vocabulary

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Evaluating the ability of Automatic Speech Recognition (ASR) models to transcribe specialized vocabulary remains a persistent challenge, as standard datasets predominantly feature common words and thus obscure weaknesses on rare or out-of-vocabulary (OOV) terms. To address this limitation, we introduce a linguistically curated bilingual dataset (English and Portuguese) comprising 13,846 utterances (18.7 hours) distributed across synthetic and literature-derived subsets, with OOV rates reaching up to 100%. We further propose a diagnostic evaluation framework that partitions recognition performance into Biased Word Error Rate (B-WER), targeting domain-specific jargon, and Unbiased Word Error Rate (U-WER), focusing on general vocabulary. Baseline evaluations using Whisper models (medium, large-v3, and large-v3-turbo) confirm the necessity of this framework. On the most challenging datasets, B-WER reaches 0.88–0.90, whereas U-WER remains as low as 0.06–0.19, demonstrating that conventional WER masks critical failure modes in jargon recognition. Additionally, an oracle upper bound experiment shows that providing correct jargon via prompting reduces B-WER by 0.50–0.70 absolute, quantifying the considerable potential for contextual biasing. We release the datasets and evaluation scripts as a reproducible benchmark to foster research on domain-aware contextual biasing and OOV handling in ASR systems.