Back to Main Conference 2004
LREC 2004main

Automated Morphological Segmentation and Evaluation

Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)

DOI:10.63317/2au2may7cmt5

Abstract

In this paper we introduce (i) a new method for morphological segmentation of German words and (ii) some measures related to the MDL principle for evaluation of morphological segmentations. Our segmentation method is based on general knowledge about inflection, derivation, and morphotactics, and part of speech information, all supplied by little effort. It includes the capabilities to generate allomorphs, to deal with hierarchical structure, and to retrieve morphemes not given in isolation in the input data. Manual evaluation of 1400 segmented types, counting omissions and false insertions of morpheme boundaries, gave 87 % recall and 98 % precision. In order to get automatic evaluation measures for morphological segmentations, we tested (i) vocabulary size and entropy measures (data size aspect of the MDL principle), (ii) model size represented as the number of states of reduced deterministic finite state automatons (DFSA) matching exactly the models' outputs, and (iii) a linear combination of (i) and (ii). These measures have been applied to segmentations of different qualities. As a result linear combination of vocabulary size and size of model-equivalent reduced DFSAs turned out to be an appropriate measure to rank segmentation models according to their quality.

Details

Paper ID
lrec2004-main-316
Pages
N/A
BibKey
reichel-weilhammer-2004-automated
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-1-6
Conference
Fourth International Conference on Language Resources and Evaluation
Location
Lisbon, Portugal
Date
26 May 2004 28 May 2004

Authors

  • UR

    Uwe D. Reichel

  • KW

    Karl Weilhammer

Links