Back to WILDRE 2024
LREC-COLING 2024workshop

Towards Disfluency Annotated Corpora for Indian Languages

Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

DOI:10.63317/2rwi9knase3y

Abstract

In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. When preparing data for subsequent downstream NLP tasks, these linguistic elements can be systematically removed, or handled as required, to enhance data quality. In this study, we present a comprehensive research on disfluencies in Indian languages. Our approach involves not only annotating real-world conversation transcripts but also conducting a detailed analysis of linguistic nuances inherent to Indian languages that are necessary to consider during annotation. Additionally, we introduce a robust algorithm for the synthetic generation of disfluent data. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages.

Details

Paper ID
lrec2024-ws-wildre-01
Pages
pp. 1-10
BibKey
kochar-etal-2024-towards
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
Location
undefined, undefined
Date
20 May 2024 25 May 2024

Authors

  • CK

    Chayan Kochar

  • VM

    Vandan Vasantlal Mujadia

  • PM

    Pruthwik Mishra

  • DS

    Dipti Misra Sharma

Links