Back to Main Conference 2016
LREC 2016main

An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/2a393no6zyj3

Abstract

In this paper, we introduce an extension of our previously released TUKE-BNews-SK corpus based on a semi-automatic annotation scheme. It firstly relies on the automatic transcription of the BN data performed by our Slovak large vocabulary continuous speech recognition system. The generated hypotheses are then manually corrected and completed by trained human annotators. The corpus is composed of 25 hours of fully-annotated spontaneous and prepared speech. In addition, we have acquired 900 hours of another BN data, part of which we plan to annotate semi-automatically. We present a preliminary corpus evaluation that gives very promising results.

Details

Paper ID
lrec2016-main-743
Pages
pp. 4684-4687
BibKey
viszlay-etal-2016-extension
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • PV

    Peter Viszlay

  • JS

    Ján Staš

  • TK

    Tomáš Koctúr

  • ML

    Martin Lojka

  • JJ

    Jozef Juhár

Links