Methods and Tools for Speech Data Acquisition exploiting a Database of German Parliamentary Speeches and Transcripts from the Internet

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

Abstract

This paper describes methods that exploit stenographic transcripts of the German parliament to improve the acoustic models of a speech recognition system for this domain. The stenographic transcripts and the speech data are available on the Internet. Using data from the Internet makes it possible to avoid the costly process of the collection and annotation of a huge amount of data. The automatic data acquisition technique works using the stenographic transcripts and acoustic data from the German parliamentary speeches plus general acoustic models, trained on different data. The idea of this technique is to generate special finite state automata from the stenographic transcripts. These finite state automata simulate potential possible correspondences between the stenographic transcript and the spoken audio content, i.e. accurate transcript. The first step is the recognition of the speech data using finite state automaton as a language model. The next step is to find, to extract and to verify the match between sections of recognized words and actually spoken audio content. After this, the automatically extracted and verified data can be used for acoustic model training. Experiments show that for a given recognition task from the German Parliament domain the absolute decrease of the word error rate is 20%.