When Does OmniASR Fail? A Fine-Grained Human Evaluation on Saudi Arabic Dialects

Proceedings of Speech Language Models in Low-Resource Settings: Performance, Evaluation, and Bias Analysis (SPEAKABLE) @ LREC 2026

DOI:10.63317/2wmxhnwmucq4

Abstract

Automatic Speech Recognition (ASR) evaluation has traditionally relied on Word Error Rate (WER), a metric that treats all errors equally and obscures critical failure modes. In this paper, we present a fine-grained human evaluation of Meta’s recently released OmniASR system on Saudi Arabic dialects using the SADA dataset. Three trained annotators evaluated 103 audio samples, producing 264 annotations across two dimensions (comprehensibility and naturalness) while categorizing errors using a novel 10-category Arabic-specific error taxonomy. OmniASR achieved a mean WER of 42.2% and mean comprehensibility of 3.62/5, but exhibited a bimodal performance pattern: 32.6% of transcriptions achieved perfect scores while 21.2% were essentially unusable. Error analysis reveals that hallucinations and deletions have the greatest negative impact on comprehensibility (−1.64 and −1.57 points respectively), roughly 6× more damaging than named entity errors. Importantly, WER correlates only moderately with human comprehensibility ratings (r = −0.679), explaining just 46% of variance in human judgments. These findings demonstrate the limitations of WER as a sole evaluation metric and highlight the need for human-centered, error-type-aware evaluation frameworks for Arabic ASR systems.