Back to Main Conference 2014
LREC 2014main

DerivBase.hr: A High-Coverage Derivational Morphology Resource for Croatian

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/28u4errifyme

Abstract

Knowledge about derivational morphology has been proven useful for a number of natural language processing (NLP) tasks. We describe the construction and evaluation of DerivBase.hr, a large-coverage morphological resource for Croatian. DerivBase.hr groups 100k lemmas from web corpus hrWaC into 56k clusters of derivationally related lemmas, so-called derivational families. We focus on suffixal derivation between and within nouns, verbs, and adjectives. We propose two approaches: an unsupervised approach and a knowledge-based approach based on a hand-crafted morphology model but without using any additional lexico-semantic resources The resource acquisition procedure consists of three steps: corpus preprocessing, acquisition of an inflectional lexicon, and the induction of derivational families. We describe an evaluation methodology based on manually constructed derivational families from which we sample and annotate pairs of lemmas. We evaluate DerivBase.hr on the so-obtained sample, and show that the knowledge-based version attains good clustering quality of 81.2% precision, 76.5% recall, and 78.8% F1 -score. As with similar resources for other languages, we expect DerivBase.hr to be useful for a number of NLP tasks.

Details

Paper ID
lrec2014-main-068
Pages
pp. 3371-3377
BibKey
snajder-2014-derivbase
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • Jan Šnajder

Links