Decomposing Creativity: Two Small Datasets Combining Originality Ratings and Metaphor Annotations
Proceedings of Learning Non-Literal Expressions with Small Data @ LREC 2026
Abstract
We introduce MetaphOrig, a small dataset comprising two genre-specific collections of spatial descriptions for the study of linguistic creativity and Non-Literal Expressions (NLEs). The sentence-level spatial descriptions were extracted from two distinct genre- and time-specific source corpora. Both source corpora comprise German texts: literary prose from the 18th to 20th century (KOLIMO) and factual travel reports from the 21st century (Wikivoyage). Along with the spatial descriptions, the dataset contains sentence-level originality ratings obtained through crowdsourcing and from four different LLMs (GPT-5, Qwen2.5-32B-Instruct, Mistral-Small-3.2-24B-Instruct, and Llama-3.2-3B), and word-level metaphor annotations. We provide the MetaphOrig datasets, including all annotations, to the community. The subsets can be used for further research on linguistic creativity or metaphor, either in one specific textual domain or comparatively across the two domains. We conduct an illustrative study on the dataset, treating originality as a proxy of textual creativity. In both subsets, we investigate potential correlations between sentence-level originality ratings and the density of metaphorical expressions within each sentence. We find the correlation to be present only in the KOLIMO subset. A comparison of human and LLM originality ratings shows that this pattern holds for both types of ratings.