Back to Main Conference 2024
LREC-COLING 2024main

Audiocite.net : A Large Spoken Read Dataset in French

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5fs6bdnemn3g

Abstract

The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasets to learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the application of these SSL methods to languages such as French has proved difficult due to the scarcity of large French speech datasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.net corpus composed of 6,682 hours of recordings from 130 readers. This corpus is composed of audiobooks from the audiocite.net website, shared by 130 readers. In addition to describing the creation process and final statistics, we also show how this dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.

Details

Paper ID
lrec2024-main-0159
Pages
pp. 1795-1800
BibKey
felice-etal-2024-audiocite
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • SF

    Soline Felice

  • SE

    Solene Virginie Evain

  • SR

    Solange Rossato

  • FP

    François Portet

Links