Back to Main Conference 2026
LREC 2026main

Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3iqgydmfqiju

Abstract

Demonstrating that large language models have memorized copyrighted material is more feasible for high-volume publishers than for smaller outlets whose content appears less frequently online. This study explores how even short, repeated sequences–rather than full articles–can serve as evidence of memorization. Focusing on Dutch news sources included in the mC4 dataset, we test whether GPT-4 and mT5 reproduce excerpts from thousands of articles, including standardized editorial boilerplate. By comparing results to a post-training baseline and modeling memorization as a survival process, we find that repeated, publication-specific phrases are significantly more likely to be completed verbatim. The approach provides a means to detect empirical evidence of memorization in cases where full reproduction is unlikely.

Details

Paper ID
lrec2026-main-473
Pages
pp. 5960-5969
BibKey
veerbeek-etal-2026-memorization
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • JV

    Joris Veerbeek

  • KB

    Kas Berendsen

  • AP

    Alessandra Polimeno

  • AB

    Antal van den Bosch

Links