Bazinga! A Dataset for Multi-Party Dialogues Structuring

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

Abstract

We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self- or weakly-supervised learning methods.

Resources

Details

Paper ID

lrec2022-main-367

Pages

pp. 3434-3441

DOI

10.63317/5hw96y77itih

BibKey

lerner-etal-2022-bazinga

Editors

Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis2020

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-38-2

Conference

Thirteenth Language Resources and Evaluation Conference

Location

Marseille, France

Date

20 - 25 June 2022

Authors

PL
Paul Lerner
JB
Juliette Bergoënd
CG
Camille Guinaudeau
HB
Hervé Bredin
BM
Benjamin Maurice
SL
Sharleyne Lefevre
MB
Martin Bouteiller
AB
Aman Berhe
LG
Léo Galmant
RY
Ruiqing Yin
CB
Claude Barras

Links

URL

DOI