HomeLREC 2020WorkshopsCLSSTSlrec2020-ws-clssts-11
Back to CLSSTS 2020
LREC 2020workshop

Subtitles to Segmentation: Improving Low-Resource Speech-to-TextTranslation Pipelines

Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

DOI:10.63317/3kycfgohqbw6

Abstract

In this work, we focus on improving ASR output segmentation in the context of low-resource language speech-to-text translation. ASR output segmentation is crucial, as ASR systems segment the input audio using purely acoustic information and are not guaranteed to output sentence-like segments. Since most MT systems expect sentences as input, feeding in longer unsegmented passages can lead to sub-optimal performance. We explore the feasibility of using datasets of subtitles from TV shows and movies to train better ASR segmentation models. We further incorporate part-of-speech (POS) tag and dependency label information (derived from the unsegmented ASR outputs) into our segmentation model. We show that this noisy syntactic information can improve model accuracy. We evaluate our models intrinsically on segmentation quality and extrinsically on downstream MT performance, as well as downstream tasks including cross-lingual information retrieval (CLIR) tasks and human relevance assessments. Our model shows improved performance on downstream tasks for Lithuanian and Bulgarian.

Details

Paper ID
lrec2020-ws-clssts-11
Pages
pp. 68-73
BibKey
wan-etal-2020-subtitles
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)
Location
undefined, undefined
Date
11 May 2020 16 May 2020

Authors

  • DW

    David Wan

  • ZJ

    Zhengping Jiang

  • CK

    Chris Kedzie

  • ET

    Elsbeth Turcan

  • PB

    Peter Bell

  • KM

    Kathy McKeown

Links