Multi-SimLex for Dutch: Benchmarking Embedding- and Prompt-Based Model Performance on Semantic Similarity
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We introduce Dutch Multi-SimLex, a 1,888–pair extension of the Multi-SimLex benchmark for evaluating lexical semantic similarity in Dutch. The dataset was rated by 100 native speakers on a 0–6 scale and shows high reliability (overall ICC(2,k)=0.82) as well as strong alignment with English (ρ=0.73). Using this resource, we evaluate eighteen models across four architectural families: static embeddings, encoder-only transformers, encoder–decoders, and decoder-only LLMs. We evaluate models using two complementary approaches: embedding-based cosine similarity and prompted similarity judgments in Dutch. In embedding-based evaluation, FastText (ρ=0.485) and the monolingual Dutch encoder BERTje (ρ=0.468) achieve the strongest alignment with human ratings, while multilingual encoders such as mBERT (ρ=0.208) and XLM-R (ρ=0.186) perform weaker. Prompt-based evaluation yields substantially higher correlations, with GPT-4 (ρ=0.761) performing best, followed by DeepSeek-V3 (ρ=0.753) and Gemini 1.5 Pro (ρ=0.722). Together, the results show that model performance depends strongly on how meaning is tested. Dutch Multi-SimLex provides a reliable foundation for evaluating meaning across architectures and advancing Dutch semantic evaluation.