Back to Main Conference 2000
LREC 2000main
Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)
Abstract
We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transfor-mation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).