HomeLREC 2026WorkshopsSLIDElrec2026-ws-slide-08
Back to SLIDE 2026
LREC 2026workshop

Automatic Lemmatisation for Norwegian

Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

DOI:10.63317/2cpfp5inka2c

Abstract

We report on a new lemmatisation system for Norwegian, which is a particularly challenging language with two written standards, Bokmål and Nynorsk, that both have a lot of optionality. Our system covers both varieties and consists of a neural model that classifies words into rewrite rule classes that produce their lemma, as well as a large-scale computational lexicon of Norwegian that gives all possible inflections of a large part of the Norwegian vocabulary. We test different ways of combining these components. When evaluated with pure string-matching against the lemmas in the gold data, all systems perform approximately at the same level (99.1-99.2% on Bokmål and 98.5-98.6% on Nynorsk), but detailed error analysis shows that the computational lexicon reduces the number of true errors by more than half (reaching 99.6% accuracy on Bokmål and 99.3% on Nynorsk), as opposed to "surface errors" like using a different, but equally acceptable spelling variant of the correct lemma.

Details

Paper ID
lrec2026-ws-slide-08
Pages
pp. 93-103
BibKey
yildirim-etal-2026-automatic
Editors
Germany) Erhard Hinrichs (Tübingen University, Sweden) Joakim Nivre (Uppsala University, Bulgaria) Petya Osenova (Sofia University, USA) James Pustejovsky (Brandeis University, Germany) Claus Zinn (Tübingen University
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • AY

    Ahmet Yildirim

  • KH

    Kristin Hagen

  • DH

    Dag Haug

Links