Back to Main Conference 2002
LREC 2002main

Word Segmentation in the Spoken Dutch Corpus

Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)

DOI:10.63317/4mnid7cw7be5

Abstract

This paper describes the aims of the word segmentation in the Spoken Dutch Corpus  (Corpus Gesproken Nederlands, CGN), and the procedures to create it. For one million words, a manually veried segmentation will be created, whereas the remaining nine million words will only come with an automatically generated segmentation. Described are our efforts to create the best possible automatic word segmentation from an auditory veried phonetic transcription, and the development of a protocol for the manual verication  of tha tautomatic segmentation. The paper also mentions some gures concerning  the manual verication of the rst hundred thousand words.

Details

Paper ID
lrec2002-main-097
Pages
N/A
BibKey
martens-etal-2002-word
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Third International Conference on Language Resources and Evaluation
Location
Las Palmas, Spain
Date
29 May 2002 31 May 2002

Authors

  • JM

    Jean-Pierre Martens

  • DB

    Diana Binnenpoorte

  • KD

    Kris Demuynck

  • RV

    Ruben Van Parys

  • TL

    Tom Laureys

  • WG

    Wim Goedertier

  • JD

    Jacques Duchateau

Links