SHONGLAP: A Large Bengali Open-Domain Dialogue Corpus

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

Abstract

We introduce SHONGLAP, a large annotated open-domain dialogue corpus in Bengali language. Due to unavailability of high-quality dialogue datasets for low-resource languages like Bengali, existing neural open-domain dialogue systems suffer from data scarcity. We propose a framework to prepare large-scale open-domain dialogue datasets from publicly available multi-party discussion podcasts, talk-shows and label them based on weak-supervision techniques which is particularly suitable for low-resource settings. Using this framework, we prepared our corpus, the first reported Bengali open-domain dialogue corpus (7.7k+ fully annotated dialogues in total) which can serve as a strong baseline for future works. Experimental results show that our corpus improves performance of large language models (BanglaBERT) in case of downstream classification tasks during fine-tuning.

Resources

Details

Paper ID

lrec2022-main-623

Pages

pp. 5797-5804

DOI

10.63317/3os45jxx3wi7

BibKey

monsur-etal-2022-shonglap

Editors

Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis2020

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-38-2

Conference

Thirteenth Language Resources and Evaluation Conference

Location

Marseille, France

Date

20 - 25 June 2022

Authors

SM
Syed Mostofa Monsur
SC
Sakib Chowdhury
MF
Md Shahrar Fatemi
SA
Shafayat Ahmed

Links

URL

DOI