Acquisition and Annotation of Slovenian Broadcast News Database
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)
Abstract
This paper presents the Slovenian Broadcast News Database project that was started in year 2002 as cooperation between University of Maribor and Slovenian national broadcaster RTV Slovenia. The resulting database will be used for large vocabulary continuous speech recognition and multimedia database retrieval or archive indexation. First some organizational aspects that were needed in initial phase of the project are described. The raw audio and video material was acquired from the original Analog Beta SP Master tapes that are preserved in the RTV Slovenia's archive. Raw material was copied to DAT and DVD media. Also additional teletext material was collected. The manual annotation of speech material is performed with the Transcriber tool. The annotation rules were defined on the basis of general rules for Broadcast News databases, with some special language dependent sections. Also some statistics on a part of current material are given.