The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

This paper presents a new Slovenian spoken language resource built from TEDx Talks. The speech database contains 242 talks in total duration of 54 hours. The annotation and transcription of acquired spoken material was generated automatically, applying acoustic segmentation and automatic speech recognition. The development and evaluation subset was also manually transcribed using the guidelines specified for the Slovenian GOS corpus. The manual transcriptions were used to evaluate the quality of unsupervised transcriptions. The average word error rate for the SI TEDx-UM evaluation subset was 50.7%, with out of vocabulary rate of 24% and language model perplexity of 390. The unsupervised transcriptions contain 372k tokens, where 32k of them were different.

Resources

Details

Paper ID

lrec2016-main-740

Pages

pp. 4670-4673

DOI

10.63317/2iqucv4omxrw

BibKey

zgank-etal-2016-si

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

AŽ
Andrej Žgank
MM
Mirjam Sepesy Maučec
DV
Darinka Verdonik

Links

URL

DOI