Back to Main Conference 2000
LREC 2000main

A New Methodology for Speech Corpora Definition from Internet Documents

Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)

DOI:10.63317/45t8uxz4apbx

Abstract

In this paper, a new methodology for speech corpora definition from internet documents is described, in order to record a large speech database, dedicated to the training and testing of acoustic models for speech recognition. In the first section, the Web robot which is in charge of collecting Web pages from Internet is presented, then the web text to French sentences filtering mechanism is explained. Some information about the corpus organization (90% for training and 10% for test) is given. In the third section, the phoneme distribution of the corpus is presented and comparison is made with others French language studies. Finally tools and planning for recording the speech database with more than one hundred speakers are described.

Details

Paper ID
lrec2000-main-178
Pages
N/A
BibKey
vaufreydaz-etal-2000-new
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Second International Conference on Language Resources and Evaluation
Location
Athens, Greece
Date
31 May 2000 2 June 2000

Authors

  • DV

    D. Vaufreydaz

  • CB

    C. Bergamini

  • JS

    J.F. Serignat

  • LB

    L. Besacier

  • MA

    M. Akbar

Links