Back to Main Conference 2026
LREC 2026main

Śmigiel Dataset: Laying Foundations for Investigating Machine-Generated Text Detection in Polish

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3p7ghe9pfm8v

Abstract

We present Śmigiel, the first open dataset for training and evaluating machine-generated text (MGT) in Polish. The dataset includes a collection of human-written text fragments from six domains, which are used to prompt text generation by eight language models capable of producing credible Polish text. In addition to the raw corpus of over 462K generated texts, we also release a cleaned source- and domain-balanced dataset suitable for training and evaluating MGT detectors. Finally, we conduct preliminary experiments with text classifiers, showing that task difficulty depends on the text domain, the generating language model, and the availability of similar data in training. The results indicate that MGT detection in Polish can be approached with general-purpose classifiers that generalize well to new LLMs, but struggle to adapt to genres not represented in the training data.

Details

Paper ID
lrec2026-main-828
Pages
pp. 10556-10568
BibKey
strebeyko-etal-2026-śmigiel
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • JS

    Jakub Strebeyko

  • AW

    Alina Wróblewska

  • PP

    Piotr Przybyła

Links