Back to Main Conference 2006
LREC 2006main

A Corpus-based Approach to the Interpretation of Unknown Words with an Application to German

Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006)

DOI:10.63317/2k7u6dyj8iut

Abstract

Usually a high portion of the different word forms in a corpusreceive no reading by the lexical and/or morphological analysis.These unknown words constitute a huge problem for NLP analysis tasks likePOS-tagging or syntactic parsing. We present a parameterizable (in principle language-independent) corpus-basedapproach for the interpretation of unknown words that only needs a tokenizedcorpus and can be used in both offline and online applications. In combination with a few linguistic (language-dependent) rules unknown verbs, adjectives, nouns, multiword units etc. are identified.Depending on the recognized word class(es), more detailed morphosyntactic and semantic information is additionally identified in opposite to the majority ofother unknown word guessing methods,which only uses a very narrow decision window to assign an unknown wordits correct reading respective Part-of-Speech tag in a given text. We tested our approach by experiments with German data and received very promising results.

Details

Paper ID
lrec2006-main-330
Pages
N/A
BibKey
klatt-2006-corpus
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
2-9517408-2-4
Conference
Fifth International Conference on Language Resources and Evaluation
Location
Genoa, Italy
Date
24 May 2006 26 May 2006

Authors

  • SK

    Stefan Klatt

Links