Back to Main Conference 2016
LREC 2016main

New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/3pnx6nwno2ic

Abstract

In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian. We introduce hrLex and srLex - two freely available inflectional lexicons of Croatian and Serbian - and describe the process of building these lexicons, supported by supervised machine learning techniques for lemma and paradigm prediction. Furthermore, we introduce hr500k, a manually annotated corpus of Croatian, 500 thousand tokens in size. We showcase the three newly developed resources on the task of morphosyntactic annotation of both languages by using a recently developed CRF tagger. We achieve best results yet reported on the task for both languages, beating the HunPos baseline trained on the same datasets by a wide margin.

Details

Paper ID
lrec2016-main-676
Pages
pp. 4264-4270
BibKey
ljubesic-etal-2016-new
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • NL

    Nikola Ljubešić

  • FK

    Filip Klubička

  • ŽA

    Željko Agić

  • IJ

    Ivo-Pavao Jazbec

Links