Samrómur: Crowd-sourcing large amounts of data

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

Abstract

This contribution describes the collection of a large and diverse corpus for speech recognition and similar tools using crowd-sourced donations. We have built a collection platform inspired by Mozilla Common Voice and specialized it to our needs. We discuss the importance of engaging the community and motivating it to contribute, in our case through competitions. Given the incentive and a platform to easily read in large amounts of utterances, we have observed four cases of speakers freely donating over 10 thousand utterances. We have also seen that women are keener to participate in these events throughout all age groups. Manually verifying a large corpus is a monumental task and we attempt to automatically verify parts of the data using tools like Marosijo and the Montreal Forced Aligner. The method proved helpful, especially for detecting invalid utterances and halving the work needed from crowd-sourced verification.

Resources

Details

Paper ID

lrec2022-main-247

Pages

pp. 2311-2316

DOI

10.63317/4e5xf3c9yjb2

BibKey

hedstrom-etal-2022-samromur

Editors

Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis2020

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-38-2

Conference

Thirteenth Language Resources and Evaluation Conference

Location

Marseille, France

Date

20 - 25 June 2022

Authors

SH
Staffan Hedström
DM
David Erik Mollberg
RÞ
Ragnheiður Þórhallsdóttir
JG
Jón Guðnason

Links

URL

DOI