Improving Slovene Language Models for Lexicographic Question Answering through Continued Pretraining and Instruction Fine-Tuning

Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

Abstract

This paper presents a two-stage training approach to improve the performance of Slovene large language models on lexicographic question-answering tasks. We developed a comprehensive lexical pretraining corpus containing 356,294 Slovene word entries. We constructed the corpus by converting structured data from multiple lexicographic sources into markdown format. Additionally, we created a question-answering dataset with 10,485 QA pairs from diverse sources, including automatically generated questions, a linguistic advisory portal, and community forums. Using the Slovenian GaMS model (based on Gemma 2 9B) and GaMS 3 model (based on Gemma 3 12B), we performed continued pretraining on the lexical corpus, followed by instruction fine-tuning with our QA dataset combined with translated general-domain questions. We compared results to different model configurations. Our results demonstrate significant improvements (text similarity increasing from 0.226 to 0.542, BERTScore F1 of 0.915) in answering Slovene lexicographic questions, validating the effectiveness of domain-specific continued pretraining for low-resource languages.

Resources

Details

Paper ID

lrec2026-ws-slide-10

Pages

pp. 114-123

DOI

10.63317/2t3zgpnbv52e

BibKey

knez-etal-2026-improving

Editors

Germany) Erhard Hinrichs (Tübingen University, Sweden) Joakim Nivre (Uppsala University, Bulgaria) Petya Osenova (Sofia University, USA) James Pustejovsky (Brandeis University, Germany) Claus Zinn (Tübingen University

Publisher

European Language Resources Association (ELRA)

ISSN

N/A

ISBN

N/A

Workshop

Proceedings of the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

Location

Palma, Mallorca, Spain

Date

11 - 16 May 2026

Authors

TK
Timotej Knez
SZ
Slavko Zitnik

Links

URL

DOI