Bengali-English and Hindi-English Code Mixed Speech Data with Disfluencies

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

Abstract

Spontaneous speech in multilingual communities such as India frequently combines code-switching (CS) and disfluencies, yet existing Bengali–English and Hindi–English speech corpora largely consist of fluent or scripted utterances. This limits their suitability for developing and evaluating automatic speech recognition (ASR) systems intended for real conversational settings, particularly in micro-resource scenarios. We introduce BEHE-CMDisfl, a synthetic speech corpus that explicitly integrates disfluency phenomena within Bengali–English and Hindi–English code-mixed (CM) utterances. The textual content was generated using prompting strategies with large language models (LLMs) to encourage controlled switching and varied disfluency patterns, including filled pauses, repetitions, and restarts. The utterances were subsequently synthesized using Indic Parler text-to-speech (TTS) system. To demonstrate usability, we establish a reproducible GMM–HMM baseline for Bengali–English ASR using Kaldi on a 1.3-hour subset of the corpus. In our experiments, improvements were mainly observed after ensuring consistency in the pronunciation lexicon and applying phonetic normalization, with the best setup reaching a word error rate (WER) of 37.74%. A closer look at the decoded transcripts suggests that filled pauses and repetitions are not automatically collapsed, but appear in the output, indicating that the disfluency cues present in the synthetic speech are captured during recognition.