Orthographic and Phonetic Annotation of Very Large Czech Corpora with Quality Assessment
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)
Abstract
The annotation is generally indivisible part of speech database. In this paper we are presenting common orthographic and phonetic annotation of large Czech databases. Phonetic annotation may be very important and gives more information than pronunciation lexicon with possible pronunciation variants. Moreover, for Czech language phonetic annotation means just small additional effort to standard ortographic transcription. The tool FTP-Trascriber developed for thispurposes is also presented. In the second part we are presenting procedure of quality assessment applied to the annotation of large speech corpora collected at our laboratories. We are presenting semi-automated quality checks based on using several fully automated pre-checks decreasing necessarry additional manual effort.