How I Met Your Snowclone: Unsupervised Discovery of Snowclone Patterns in Large Datasets
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Snowclones are a type of Multiword Expression (MWE) pattern that includes open slots, i.e. positions that can be filled with various words. For example, in the phrase "May the X be with you," the slot X can be replaced with virtually any noun. A key feature of snowclones is that the original MWE remains recognizable, carrying its meaning into the new form. However, previous work has not shown whether such substitutions are limited to fixed positions. In practice, variations such as "May the force bee with you" are also possible. In this paper, we propose to use Locality Sensitive Hashing (LSH) to automatically extract snowclone patterns from the non-commercial IMDb dataset. This process results in the creation of the FROST lexicon, comprising 29,011 pattern candidates and 991,626 snowclone candidates distributed in 29 languages. We then annotate 1,500 discovered patterns and 1,000 snowclones from the FROST lexicon to assess its quality. Our findings suggest that (i) most substitutions in snowclones occur at consistent positions and (ii) snowclones can be reliably discovered at scale using LSH and similarity-based metrics. This work provides the first large-scale lexicon of snowclone-based MWEs and a method that can support future research on MWEs and snowclones discovery.