Disfluencies and ASR Performance on Swedish Spontaneous Speech from the ‘Trip to Stockholm’ Discourse Narrative Task
Proceedings of the Sixth Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments in cooperation with the MENTAL.ai consortium
Abstract
Automatic Speech Recognition (ASR) offers a scalable and cost-efficient alternative to manual transcription and is becoming increasingly relevant in clinical contexts, particularly for the detection of cognitive decline and mental health assessment. However, current ASR-systems still struggle with spontaneous speech, particularly when processing disfluencies, pauses, and speaker variability that often carry diagnostic value. This study evaluates state-of-the-art open ASR models targeting Swedish using recordings from the "Trip to Stockholm" discourse narrative task which elicits ecologically valid, cognitively demanding speech. Recognition quality is assessed using various metrics, alongside an analysis of linguistic and technical sources of error focused on disfluencies. Our findings show that disfluency-related phenomena degrade recognition performance. Possible post-processing strategies can improve specific error patterns emerging for filled pauses, word repetitions, and self-corrections. The results illustrate both the advances and ongoing limitations of ASR for spontaneous Swedish speech, emphasizing the need for models explicitly trained, or fine-tuned, on disfluent data to ensure robustness in clinical and research applications.