Bilingual Spoken Monologue Corpus for Simultaneous Machine Interpretation Research

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

Abstract

This paper describes a large-scale bilingual corpus of spoken monologues and their simultaneous interpretation, which has been constructed at CIAIR. The corpus has the following characteristics: (1) English and Japanese speeches are recorded in parallel, (2) the data contains monologue speechessuch as lecture and self-introduction, and (3) the exact beginning and ending times are provided for each utterance. We have collected a total of about 70 hours of speech data and transcribed them into ASCII text files. The corpus will be made publicly available in the near future. This paper also provides an analysis of the professional interpreter's speeches using the bilingual corpus. The following points have been investigated: (1) the interpreting unit of simultaneous interpretation, (2) the difference between the beginning time of the lecturer's utterance and that of the interpreter's utterance, and (3) the interpreter's speaking speed. The characteristic features about the timing at which simultaneous interpreters start to speak is presented. The analysis will be available for the development of a simultaneous machine interpreting system.