DAMETA: An LLM Benchmark for Danish Metaphor Interpretation with Systematically Varied Distractors
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present DAMETA, the first evaluation benchmark for Danish metaphor interpretation in language models, derived from the following sources: an annotated corpus (the Dafig Corpus), the Danish dictionary (DDO) and culture reviews in Danish newspapers. Each of the 900 data instances contains a sentence with a metaphorical target word and four human-created paraphrase options; one correct interpretation and three systematic errors or distractors: i) a false literal paraphrase (typically concrete), ii) a false figurative paraphrase (typically abstract), and iii) a false contradictory paraphrase. The benchmark is tested on seven language models, and 5% of the data is further tested on humans for comparison. Results show, among others, that when informed in the prompt that the target word is a metaphor, the models tend to be most distracted by the false figurative paraphrase; in contrast, when uninformed about the metaphorical setting, the models are more distracted by the false literal paraphrase. The dataset goes beyond standard by incorporating descriptive metadata regarding metaphor conventionality on a 3-graded scale (lexicalised, implicit, and ad-hoc), alongside a range of dictionary-derived source domains (military, gastronomy, health, meteorology, etc.). These metadata enable deeper analysis and potentially innovative insights of model performance regarding creativity, language change, and culture-sensitivity.