Back to Main Conference 2024
LREC-COLING 2024main

TARIC-SLU: A Tunisian Benchmark Dataset for Spoken Language Understanding

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/4evgpxxgwme5

Abstract

In recent years, there has been a significant increase in interest in developing Spoken Language Understanding (SLU) systems. SLU involves extracting a list of semantic information from the speech signal. A major issue for SLU systems is the lack of sufficient amount of bi-modal (audio and textual semantic annotation) training data. Existing SLU resources are mainly available in high-resource languages such as English, Mandarin and French. However, one of the current challenges concerning low-resourced languages is data collection and annotation. In this work, we present a new freely available corpus, named TARIC-SLU, composed of railway transport conversations in Tunisian dialect that is continuously annotated in dialogue acts and slots. We describe the semantic model of the dataset, the data and experiments conducted to build ASR-based and SLU-based baseline models. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and will be integrated to SpeechBrain, a popular open-source conversational AI toolkit based on PyTorch.

Details

Paper ID
lrec2024-main-1357
Pages
pp. 15606-15616
BibKey
mdhaffar-etal-2024-taric
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • SM

    Salima Mdhaffar

  • FB

    Fethi Bougares

  • Rd

    Renato de Mori

  • SZ

    Salah Zaiem

  • MR

    Mirco Ravanelli

  • YE

    Yannick Estève

Links