HomeLREC 2026WorkshopsNEOLLMlrec2026-ws-neollm-01
Back to NEOLLM 2026
LREC 2026workshop

From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

Proceedings of the Workshop Neology and Large Language Models

DOI:10.63317/4o6ks86o293r

Abstract

We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (NEOLOGISM, ENTITY, FOREIGN, NONE). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005–2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations.

Details

Paper ID
lrec2026-ws-neollm-01
Pages
pp. 1-15
BibKey
rossini-etal-2026-124
Editors
Giedre Valunaite Oleskeviciene, Voula Giouli, Florentina Armaselu, Chaya Liebeskind, Barbara McGillivray
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Workshop Neology and Large Language Models
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • DR

    Diego Rossini

  • Lv

    Lonneke van der Plas

Links