Back to Main Conference 2004
LREC 2004main

Bootstrapping a Database of German Multi-word Expressions

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/2tgbmuu47jbw

Abstract

We pre-classified 32,000 entries from the {Wörterbuch der deutschen Idiomatik} (Schemann 1993) using an inductive description of POS sequences in conjunction with a Brill Tagger trained on manually tagged idiomatic entries. This process assigned categories to 86% of entries with 88% accuracy. Further manual classification resulted in a database of multi-word expressions where each entry is associated with a sequence of POS-tag/token pairs. The second phase of our project, currently underway, addresses the association of a sequence of POS-tag/token pairs with a corpus example. To this end, we generate a weighted finite state transducer from the sequences for each entry and apply a finite state filter to the corpus. The filter will extract those sequences in the corpus that correspond to the longest match of the multi-word expression.

Details

Paper ID
lrec2004-main-371
Pages
N/A
BibKey
geyken-2004-bootstrapping
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • AG

    Alexander Geyken

Links