Word Segmentation in the Spoken Dutch Corpus

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

Abstract

This paper describes the aims of the word segmentation in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), and the procedures to create it. For one million words, a manually veried segmentation will be created, whereas the remaining nine million words will only come with an automatically generated segmentation. Described are our efforts to create the best possible automatic word segmentation from an auditory veried phonetic transcription, and the development of a protocol for the manual verication of tha tautomatic segmentation. The paper also mentions some gures concerning the manual verication of the rst hundred thousand words.