Linguistic Resources for Effective, Affordable, Reusable Speech-to-Text
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)
Abstract
This paper describes ongoing efforts at Linguistic Data Consortium to create shared evaluation resources for improved speech-to-text technology. The DARPA EARS Program (Effective, Affordable, Reusable Speech-to-Text) is focused on enabling core STT technology to produce rich, highly accurate output in a range of languages and speaking styles. The aggressive EARS program goals motivate new approaches to corpus creation and distribution. EARS research sites require multilingual broadcast news and telephone speech, transcripts and annotations at a much higher volume than for any previous technology program. In response to these demands, LDC has developed new corpora for training and evaluating speech-to-text systems in English, Arabic and Chinese and to support systems that distinguish speakers, identify and repair disfluencies and punctuate a text to improve readability.