Title

Word Segmentation in the Spoken Dutch Corpus

Authors

Jean-Pierre Martens (ELIS, University of Ghent, Sint-Pietersnieuwstraat 41 B-9000 Ghent, Belgium)

Diana Binnenpoorte (Dept Language & Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands)

Kris Demuynck (ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium)

Ruben Van Parys (ELIS, University of Ghent, Sint-Pietersnieuwstraat 41 B-9000 Ghent, Belgium)

Tom Laureys (ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium)

Wim Goedertier (ELIS, University of Ghent, Sint-Pietersnieuwstraat 41 B-9000 Ghent, Belgium)

Jacques Duchateau (ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium)

Session

SO7: Tools For Spoken LRs

Abstract

This paper describes the aims of the word segmentation in the Spoken Dutch Corpus  (Corpus Gesproken Nederlands, CGN), and the procedures to create it. For one million words, a manually veried segmentation will be created, whereas the remaining nine million words will only come with an automatically generated segmentation. Described are our efforts to create the best possible automatic word segmentation from an auditory veried phonetic transcription, and the development of a protocol for the manual verication  of tha tautomatic segmentation. The paper also mentions some gures concerning  the manual verication of the rst hundred thousand words. 

Keywords

Word segmentation

Full Paper

97.pdf