HomeLREC 2026WorkshopsLEGALlrec2026-ws-legal-10
Back to LEGAL 2026
LREC 2026workshop

Legal Considerations in the Use of Synthetic Data for AI Development and Finetuning: The Case of LLMs4EU

Proceedings of the Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Computational Approaches to Language Data Pseudonymization, Anonymization, De-identification, and Data Privacy (LEGAL2026 and CALD-pseudo 2026) @ LREC 2026

DOI:10.63317/3vwisn8odtmp

Abstract

This paper examines the legal implications of using synthetic data to develop and fine‑tune general‑purpose AI models in the European Union, using the LLMs4EU project as a case study. It situates synthetic data within the Union’s broader data policy and highlights it as a candidate tool for reconciling data availability with regulatory constraints. From a data‑protection perspective, it analyses whether and when synthetic data should be classified as "personal data" under the GDPR. From a copyright and contractual standpoint, the paper assesses the risks that synthetic datasets may embed infringing content or derive from unlawfully trained models, in light of the GEMA v. OpenAI ruling on memorised works and emerging analyses of liability for AI‑generated outputs, and considers the constraints imposed by model licensing and acceptable‑use policies on using models to generate training data for other models. The paper concludes that synthetic data can play a valuable role in mitigating legal risks and enabling compliant AI development in LLMs4EU, but only if its generation and use are embedded in robust governance frameworks that address data protection, copyright and contractual obligations across the entire data value chain.

Details

Paper ID
lrec2026-ws-legal-10
Pages
pp. 86-90
BibKey
talmoudi-etal-2026-legal
Editors
Ingo Siegert, Maria Irena Szawerna, Khalid Choukri, Simon Dobnik, Paweł Kamocki, Therese Lindström Tiedemann, Pierre Lison, Ricardo Muñoz Sánchez, Ildikó Pilán, Lisa Södergård, Kossay Talmoudi, Elena Volodina, Xuan-Son Vu
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Computational Approaches to Language Data Pseudonymization, Anonymization, De-identification, and Data Privacy (LEGAL2026 and CALD-pseudo 2026) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • KT

    Kossay Talmoudi

  • KC

    Khalid Choukri

  • AG

    Amélie Gourgeot

  • FA

    Florine Astruc

Links