HomeLREC 2026WorkshopsWILDRElrec2026-ws-wildre-05
Back to WILDRE 2026
LREC 2026workshop

Bengali-English and Hindi-English Code Mixed Speech Data with Disfluencies

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

DOI:10.63317/254iufgehzzc

Abstract

Spontaneous speech in multilingual communities such as India frequently combines code-switching (CS) and disfluencies, yet existing Bengali–English and Hindi–English speech corpora largely consist of fluent or scripted utterances. This limits their suitability for developing and evaluating automatic speech recognition (ASR) systems intended for real conversational settings, particularly in micro-resource scenarios. We introduce BEHE-CMDisfl, a synthetic speech corpus that explicitly integrates disfluency phenomena within Bengali–English and Hindi–English code-mixed (CM) utterances. The textual content was generated using prompting strategies with large language models (LLMs) to encourage controlled switching and varied disfluency patterns, including filled pauses, repetitions, and restarts. The utterances were subsequently synthesized using Indic Parler text-to-speech (TTS) system. To demonstrate usability, we establish a reproducible GMM–HMM baseline for Bengali–English ASR using Kaldi on a 1.3-hour subset of the corpus. In our experiments, improvements were mainly observed after ensuring consistency in the pronunciation lexicon and applying phonetic normalization, with the best setup reaching a word error rate (WER) of 37.74%. A closer look at the decoded transcripts suggests that filled pauses and repetitions are not automatically collapsed, but appear in the output, indicating that the disfluency cues present in the synthetic speech are captured during recognition.

Details

Paper ID
lrec2026-ws-wildre-05
Pages
pp. 39-48
BibKey
mitra-etal-2026-bengali
Editors
Girish Nath Jha, Kalika Bali, Sobha L, Devendr Kumar
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • AM

    Anuran Mitra

  • TM

    Tapabrata Mondal

  • AC

    Anirvan Chakravarty

  • SB

    Sivaji Bandyopadhyay

Links