Back to Main Conference 2024
LREC-COLING 2024main

SamróMur MilljóN: An ASR Corpus of One Million Verified Read Prompts in Icelandic

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/2pzie36f7qf7

Abstract

The platform samromur.is, or “Samrómur” for short, is a crowdsourcing web application built on Mozilla’s Common Voice, designed to accumulate speech data for the advancement of language technologies in Icelandic. Over the years, Samrómur has proven to be remarkably successful in amassing a significant number of high-quality audio clips from thousands of users. However, the challenge of manually verifying the entirety of the collected data has hindered its effective exploitation, especially in the realm of Automatic Speech Recognition (ASR), its original purpose. In this paper, we introduce the “Samrómur Milljón” corpus, an ASR dataset comprising one million audio clips from Samrómur. These clips have been automatically verified using state-of-the-art speech recognition systems such as NeMo, Wav2Vec2, and Whisper. Additionally, we present the ASR results obtained from creating acoustic models based on Samrómur Milljón. These results demonstrate significant promise when compared to other acoustic models trained with a similar volume of Icelandic data from different sources.

Details

Paper ID
lrec2024-main-1246
Pages
pp. 14305-14312
BibKey
hernandez-mena-etal-2024-samromur
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • CH

    Carlos Daniel Hernandez Mena

  • ÞG

    Þorsteinn Daði Gunnarsson

  • JG

    Jon Gudnason

Links