ASR for Documenting Acutely Under-Resourced Indigenous Languages

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Abstract

Despite its potential utility for facilitating the transcription of speech recordings, automatic speech recognition (ASR) has not been widely explored as a tool for documenting endangered languages. One obstacle to adopting ASR for this purpose is that the amount of data needed to build a reliable ASR system far exceeds what would typically be available in an endangered language. Languages with highly complex morphology present further data sparsity challenges. In this paper, we present a working ASR system for Seneca, an endangered indigenous language of North America, as a case study for the development of ASR for acutely low-resource languages in need of linguistic documentation. We explore methods of leveraging linguistic knowledge to improve the ASR language models for a polysynthetic language with few high-quality audio and text resources, and we propose a tool for using ASR output to bootstrap new data to iteratively improve the acoustic model. This work serves as a proof-of-concept for speech researchers interested helping field linguists and indigenous language community members engaged in the documentation and revitalization of endangered languages.

Resources

Details

Paper ID

lrec2018-main-657

Pages

N/A

DOI

10.63317/3w22zkanvb4z

BibKey

jimerson-prudhommeaux-2018-asr

Editors

Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-00-9

Conference

Eleventh International Conference on Language Resources and Evaluation

Location

Miyazaki, Japan

Date

7 - 12 May 2018

Authors

RJ
Robbie Jimerson
EP
Emily Prud’hommeaux

Links

URL

DOI