The Swedish Benchmark of Linguistic Minimal Pairs

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

We introduce the Swedish Benchmark of Linguistic Minimal Pairs, a dataset for evaluating syntactic performance in language models. It includes 2,500 minimal pairs organized into 25 syntactic phenomena, with 100 pairs per phenomenon. Each pair contrasts a well-formed and an ill-formed sentence that differ minimally. For each phenomenon, we manually constructed ten pairs from scratch. We semi-automatically generated the remaining 90 pairs and manually adjusted them. A random sample was assessed by 40 participants, who selected the well-formed sentence in 98.05% of cases. We evaluate eleven state-of-the-art models. Results generally show that models handle local agreement well but struggle with certain long-distance dependencies and word order phenomena. Model size seems to matter less than the training domain. Prompt-based evaluation generally lowers performance. We show that model performance is stable across handcrafted and generated subsets and across sample sizes, suggesting that 100 pairs per phenomenon suffice for reliable evaluation. Future work will expand the number of phenomena.