Back to Main Conference 2016
LREC 2016main

Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/4q6csy438zgf

Abstract

In this paper we present a tagger developed for inflectionally rich languages for which both a training corpus and a lexicon are available. We do not constrain the tagger by the lexicon entries, allowing both for lexicon incompleteness and noisiness. By using the lexicon indirectly through features we allow for known and unknown words to be tagged in the same manner. We test our tagger on Slovene data, obtaining a 25% error reduction of the best previous results both on known and unknown words. Given that Slovene is, in comparison to some other Slavic languages, a well-resourced language, we perform experiments on the impact of token (corpus) vs. type (lexicon) supervision, obtaining useful insights in how to balance the effort of extending resources to yield better tagging results.

Details

Paper ID
lrec2016-main-242
Pages
pp. 1527-1531
BibKey
ljubesic-erjavec-2016-corpus
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • NL

    Nikola Ljubešić

  • TE

    Tomaž Erjavec

Links