Back to Main Conference 2022
LREC 2022main

Huqariq: A Multilingual Speech Corpus of Native Languages of Peru forSpeech Recognition

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/2owz525espy7

Abstract

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employs the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.

Details

Paper ID
lrec2022-main-537
Pages
pp. 5029-5034
BibKey
zevallos-etal-2022-huqariq
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • RZ

    Rodolfo Zevallos

  • LC

    Luis Camacho

  • NM

    Nelsi Melgarejo

Links