HomeLREC 2026WorkshopsWILDRElrec2026-ws-wildre-03
Back to WILDRE 2026
LREC 2026workshop

Konkani Daan: A Community-Driven Culturally Grounded Speech Corpus for Low-Resource ASR

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

DOI:10.63317/2auy49fgeg5o

Abstract

Indian languages are deeply embedded in cultural traditions, oral narratives, regional lexicons, and socially grounded communicative practices. However, existing speech resources and large multilingual ASR models often underrepresent culturally rich and naturally occurring speech varieties. In this paper, we introduce Konkani Daan, a community-driven initiative for collecting culturally grounded speech data for the Konkani language. The corpus currently comprises over 43.9 hours of 16 kHz speech recordings contributed through a web-based participatory platform. We evaluate a strong multilingual baseline, AI4Bharat IndicConformer-600M, in zero-shot mode on the Konkani Daan development set (379 utterances), achieving Word Error Rate (WER) of 46.46% and Character Error Rate (CER) of 15.47%, indicating substantial domain and cultural mismatch. Through qualitative error analysis, we identify systematic challenges including compound word segmentation, numeric normalisation, named entity distortion, and orthographic variation. Our findings demonstrate that culturally dense community speech exposes systematic limitations in multilingual ASR systems and motivates normalisation-aware and culturally informed evaluation strategies.

Details

Paper ID
lrec2026-ws-wildre-03
Pages
pp. 25-32
BibKey
shivolkar-etal-2026-konkani
Editors
Girish Nath Jha, Kalika Bali, Sobha L, Devendr Kumar
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • MS

    Milind Shivolkar

  • VG

    Vaibhav Gawas

  • JP

    Jyoti Pawar

Links