HomeLREC 2026WorkshopsSIGULlrec2026-ws-sigul-22
Back to SIGUL 2026
LREC 2026workshop

From LLM Prompts to Acoustic Baselines: A Scalable Pipeline for Under-Resourced Disfluent Code-Mixed Speech

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

DOI:10.63317/5k9xcfdyrtev

Abstract

Spontaneous speech in multilingual communities often involves rapid code-mixing (CM) and natural disfluencies, yet such patterns are rarely reflected in available training data for under-resourced languages. This gap limits the development of robust automatic speech recognition (ASR) systems. To address this, we introduce BEHE-CMDisfl, a fully synthetic Bengali–English and Hindi–English disfluent code-mixed speech corpus generated through a controlled Large Language Model (LLM) and Text-to-Speech (TTS) pipeline. The dataset explicitly incorporates conversational phenomena such as filled pauses, repetitions, and restarts. We evaluate its usefulness under two ASR settings. In a micro-resource scenario (∼1.3 hours), a GMM-HMM Kaldi baseline achieved a 37.74% Word Error Rate (WER) after phonetic normalization to reduce transliteration inconsistencies, and successfully retained disfluency markers in decoding. We also examined adaptation of a modern foundation model. In zero-shot testing, openai/whisper-small failed on the code-mixed speech due to severe hallucinations and looping behavior. After applying parameter-efficient fine-tuning (LoRA) for 1,000 steps, the model stabilized, reduced insertion errors, captured rapid language switching more reliably, and achieved a WER of 21.37%. These findings show that synthetic data combined with efficient fine-tuning offers a practical path for ASR development in complex low-resource disfluent CM settings.

Details

Paper ID
lrec2026-ws-sigul-22
Pages
pp. 222-232
BibKey
mitra-etal-2026-llm
Editors
Atul Kr. Ojha, Sakriani Sakti, Claudia Soria, Maite Melero, John P. McCrae, Constantine Lignos, Chao-Hong Liu, German Rigau Claramunt, Georg Rehm
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • AM

    Anuran Mitra

  • AC

    Anirvan Chakravarty

  • TM

    Tapabrata Mondal

  • SB

    Sivaji Bandyopadhyay

Links