Back to Main Conference 2002
LREC 2002main

Bilingual Spoken Monologue Corpus for Simultaneous Machine Interpretation Research

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

DOI:10.63317/2airahzn8k72

Abstract

This paper describes a large-scale bilingual corpus of spoken monologues and their simultaneous interpretation, which has been constructed at CIAIR. The corpus has the following characteristics: (1) English and Japanese speeches are recorded in parallel, (2) the data contains monologue speechessuch as lecture and self-introduction, and (3) the exact beginning and ending times are provided for each utterance. We have collected a total of about 70 hours of speech data and transcribed them into ASCII text files.  The corpus will be made publicly available in the near future. This paper also provides an analysis of the professional interpreter's speeches using the bilingual corpus. The following points have been investigated: (1) the interpreting unit of simultaneous interpretation, (2) the difference between the beginning time of the lecturer's utterance and that of the interpreter's utterance, and (3) the interpreter's speaking speed. The characteristic features about the timing at which simultaneous interpreters start to speak is presented. The analysis will be available for the development of a simultaneous machine interpreting system.

Details

Paper ID
lrec2002-main-273
Pages
N/A
BibKey
matsubara-etal-2002-bilingual
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Third International Conference on Language Resources and Evaluation
Location
Las Palmas, Spain
Date
29 May 2002 31 May 2002

Authors

  • SM

    Shigeki Matsubara

  • AT

    Akira Takagi

  • NK

    Nobuo Kawaguchi

  • YI

    Yasuyoshi Inagaki

Links