Back to Main Conference 2026
LREC 2026main
Common Voice for Pakistan: Developing an Open Speech Corpus for Low-Resource Pakistani Languages
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Pakistan is home to more than 70 languages out of which 30 languages are endangered. Most of Pakistani languages remain absent from modern speech and text technologies, with resources focused on Urdu and a few major tongues. Through Mozilla’s Open Multilingual Speech Fund, this paper documents one year project for the development of an open, community driven speech corpus for 39 indigenous languages of Pakistan. The dataset includes locally authored texts, daily life sentences, poetry, and folk songs to make a culturally balanced. The project not only supports Automatic Speech Recognition but also promote linguistic preservation and digital inclusion.