Back to Main Conference 2024
LREC-COLING 2024main

Speech Recognition Corpus of the Khinalug Language for Documenting Endangered Languages

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/4evbjywgxqco

Abstract

Automatic Speech Recognition (ASR) can be a valuable tool to document endangered languages. However, building ASR tools for these languages poses several difficult research challenges, notably data scarcity. In this paper, we show the whole process of creating a useful ASR tool for language documentation scenarios. We publish the first speech corpus for Khinalug, an endangered language spoken in Northern Azerbaijan. The corpus consists of 2.67 hours of labeled data from recordings of spontaneous speech about various topics. As Khinalug is an extremely low-resource language, we investigate the benefits of multilingual models for self-supervised learning and supervised learning and achieve the performance of 6.65 Character Error Rate (CER) points and 25.53 Word Error Rate (WER) points. The benefits of multilingual models are further validated through experimentation with three additional under-resourced languages. Lastly, this work conducts quality assessments with linguists on new recordings to investigate the model’s usefulness in language documentation. We observe an evident degradation for new recordings, indicating the importance of enhancing model robustness. In addition, we find the inaudible content is the main cause of wrong ASR predictions, suggesting relating work on incorporating contextual information.

Details

Paper ID
lrec2024-main-1319
Pages
pp. 15171-15180
BibKey
li-etal-2024-speech
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • ZL

    Zhaolin Li

  • MR

    Monika Rind-Pawlowski

  • JN

    Jan Niehues

Links