Back to Main Conference 2022
LREC 2022main

Generating Artificial Texts as Substitution or Complement of Training Data

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/5n3kmjij5xgy

Abstract

The quality of artificially generated texts has considerably improved with the advent of transformers. The question of using these models to generate learning data for supervised learning tasks naturally arises, especially when the original language resource cannot be distributed, or when it is small. In this article, this question is explored under 3 aspects: (i) are artificial data an efficient complement? (ii) can they replace the original data when those are not available or cannot be distributed for confidentiality reasons? (iii) can they improve the explainability of classifiers? Different experiments are carried out on classification tasks - namely sentiment analysis on product reviews and Fake News detection - using artificially generated data by fine-tuned GPT-2 models. The results show that such artificial data can be used in a certain extend but require pre-processing to significantly improve performance. We also show that bag-of-words approaches benefit the most from such data augmentation.

Details

Paper ID
lrec2022-main-453
Pages
pp. 4260-4269
BibKey
claveau-etal-2022-generating
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • VC

    Vincent Claveau

  • AC

    Antoine Chaffin

  • EK

    Ewa Kijak

Links