Back to Main Conference 2022
LREC 2022main

Bazinga! A Dataset for Multi-Party Dialogues Structuring

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/5hw96y77itih

Abstract

We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self- or weakly-supervised learning methods.

Details

Paper ID
lrec2022-main-367
Pages
pp. 3434-3441
BibKey
lerner-etal-2022-bazinga
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • PL

    Paul Lerner

  • JB

    Juliette Bergoënd

  • CG

    Camille Guinaudeau

  • HB

    Hervé Bredin

  • BM

    Benjamin Maurice

  • SL

    Sharleyne Lefevre

  • MB

    Martin Bouteiller

  • AB

    Aman Berhe

  • LG

    Léo Galmant

  • RY

    Ruiqing Yin

  • CB

    Claude Barras

Links