Back to Main Conference 2026
LREC 2026main

Common Voice for Pakistan: Developing an Open Speech Corpus for Low-Resource Pakistani Languages

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4r3mie85u8cq

Abstract

Pakistan is home to more than 70 languages out of which 30 languages are endangered. Most of Pakistani languages remain absent from modern speech and text technologies, with resources focused on Urdu and a few major tongues. Through Mozilla’s Open Multilingual Speech Fund, this paper documents one year project for the development of an open, community driven speech corpus for 39 indigenous languages of Pakistan. The dataset includes locally authored texts, daily life sentences, poetry, and folk songs to make a culturally balanced. The project not only supports Automatic Speech Recognition but also promote linguistic preservation and digital inclusion.

Details

Paper ID
lrec2026-main-265
Pages
pp. 3355-3359
BibKey
alam-etal-2026-common
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MA

    Meesum Alam

  • FT

    Francis Tyers

Links