Punctuation Prediction for Unsegmented Transcript Based on Word Vector

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Abstract

In this paper we propose an approach to predict punctuation marks for unsegmented speech transcript. The approach is purely lexical, with pre-trained Word Vectors as the only input. A training model of Deep Neural Network (DNN) or Convolutional Neural Network (CNN) is applied to classify whether a punctuation mark should be inserted after the third word of a 5-words sequence and which kind of punctuation mark the inserted one should be. TED talks within IWSLT dataset are used in both training and evaluation phases. The proposed approach shows its effectiveness by achieving better result than the state-of-the-art lexical solution which works with same type of data, especially when predicting puncuation position only.

Resources

Details

Paper ID

lrec2016-main-103

Pages

pp. 654-658

DOI

10.63317/39kpgdnbwmrw

BibKey

che-etal-2016-punctuation

Editors

Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

978-2-9517408-9-1

Conference

Tenth International Conference on Language Resources and Evaluation

Location

Portorož, Slovenia

Date

23 - 28 May 2016

Authors

XC
Xiaoyin Che
CW
Cheng Wang
HY
Haojin Yang
CM
Christoph Meinel

Links

URL

DOI