Legal Considerations in the Use of Synthetic Data for AI Development and Finetuning: The Case of LLMs4EU
Proceedings of the Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Computational Approaches to Language Data Pseudonymization, Anonymization, De-identification, and Data Privacy (LEGAL2026 and CALD-pseudo 2026) @ LREC 2026
Abstract
This paper examines the legal implications of using synthetic data to develop and fine‑tune general‑purpose AI models in the European Union, using the LLMs4EU project as a case study. It situates synthetic data within the Union’s broader data policy and highlights it as a candidate tool for reconciling data availability with regulatory constraints. From a data‑protection perspective, it analyses whether and when synthetic data should be classified as "personal data" under the GDPR. From a copyright and contractual standpoint, the paper assesses the risks that synthetic datasets may embed infringing content or derive from unlawfully trained models, in light of the GEMA v. OpenAI ruling on memorised works and emerging analyses of liability for AI‑generated outputs, and considers the constraints imposed by model licensing and acceptable‑use policies on using models to generate training data for other models. The paper concludes that synthetic data can play a valuable role in mitigating legal risks and enabling compliant AI development in LLMs4EU, but only if its generation and use are embedded in robust governance frameworks that address data protection, copyright and contractual obligations across the entire data value chain.