Back to Main Conference 2000
LREC 2000main

Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus

Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)

DOI:10.63317/354z5j8ik3p8

Abstract

This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator.

Details

Paper ID
lrec2000-main-164
Pages
N/A
BibKey
van-eynde-etal-2000-part
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
N/A
Conference
Second International Conference on Language Resources and Evaluation
Location
Athens, Greece
Date
31 May 2000 2 June 2000

Authors

  • FV

    Frank Van Eynde

  • JZ

    Jakub Zavrel

  • WD

    Walter Daelemans

Links