Back to Main Conference 2016
LREC 2016main

Graph-Based Induction of Word Senses in Croatian

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

DOI:10.63317/3z6sy3u7ttp8

Abstract

Word sense induction (WSI) seeks to induce senses of words from unannotated corpora. In this paper, we address the WSI task for the Croatian language. We adopt the word clustering approach based on co-occurrence graphs, in which senses are taken to correspond to strongly inter-connected components of co-occurring words. We experiment with a number of graph construction techniques and clustering algorithms, and evaluate the sense inventories both as a clustering problem and extrinsically on a word sense disambiguation (WSD) task. In the cluster-based evaluation, Chinese Whispers algorithm outperformed Markov Clustering, yielding a normalized mutual information score of 64.3. In contrast, in WSD evaluation Markov Clustering performed better, yielding an accuracy of about 75%. We are making available two induced sense inventories of 10,000 most frequent Croatian words: one coarse-grained and one fine-grained inventory, both obtained using the Markov Clustering algorithm.

Details

Paper ID
lrec2016-main-481
Pages
pp. 3014-3018
BibKey
bekavac-snajder-2016-graph
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-9-1
Conference
Tenth International Conference on Language Resources and Evaluation
Location
Portorož, Slovenia
Date
23 May 2016 28 May 2016

Authors

  • MB

    Marko Bekavac

  • Jan Šnajder

Links