Konkani Daan: A Community-Driven Culturally Grounded Speech Corpus for Low-Resource ASR

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

Abstract

Indian languages are deeply embedded in cultural traditions, oral narratives, regional lexicons, and socially grounded communicative practices. However, existing speech resources and large multilingual ASR models often underrepresent culturally rich and naturally occurring speech varieties. In this paper, we introduce Konkani Daan, a community-driven initiative for collecting culturally grounded speech data for the Konkani language. The corpus currently comprises over 43.9 hours of 16 kHz speech recordings contributed through a web-based participatory platform. We evaluate a strong multilingual baseline, AI4Bharat IndicConformer-600M, in zero-shot mode on the Konkani Daan development set (379 utterances), achieving Word Error Rate (WER) of 46.46% and Character Error Rate (CER) of 15.47%, indicating substantial domain and cultural mismatch. Through qualitative error analysis, we identify systematic challenges including compound word segmentation, numeric normalisation, named entity distortion, and orthographic variation. Our findings demonstrate that culturally dense community speech exposes systematic limitations in multilingual ASR systems and motivates normalisation-aware and culturally informed evaluation strategies.