Mining the Spoken Wikipedia for Speech Data and Beyond

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available.

Resources

Details

Paper ID

lrec2016-main-735

Pages

pp. 4644-4647

DOI

10.63317/5dnpnbxgjfh6

BibKey

kohn-etal-2016-mining

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

AK
Arne Köhn
FS
Florian Stegen
TB
Timo Baumann

Links

URL

DOI