Back to Main Conference 2026
LREC 2026main

Slovene Morphological and Word Formation Segmentation: A Novel Dataset and Evaluation

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4f6rruft238c

Abstract

We introduce the first publicly available manually annotated dataset for morphological segmentation and word-formation analysis for Slovene, containing 1,935 words annotated by two domain experts. The dataset provides three types of linguistic information: morphological and word-formation segments with zero-morpheme and simplex annotations. We present a four-stage annotation approach achieving inter-annotator agreement of 86.80% Krippendorff’s Alpha for morphological segmentation and 85.16% for word-formation segments. Computational validation using a morphological segmentation model achieves 87.78% BPR F1 on morphological segmentation and 83.05% on word-formation segments. Despite being smaller than previous datasets derived from non-public esources, our dataset enables high performance and supports reproducible research for morphological analysis tools for Slovene.

Details

Paper ID
lrec2026-main-140
Pages
pp. 1781-1793
BibKey
pranji-etal-2026-slovene
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MP

    Marko Pranjić

  • BK

    Boris Kern

  • IV

    Ines Voršič

  • SP

    Senja Pollak

Links