Back to Main Conference 2024
LREC-COLING 2024main

WikiSplit++: Easy Data Refinement for Split and Rephrase

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/3vkuxcodebd2

Abstract

The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still suffers from hallucinations and under-splitting. To address these issues, this paper presents a simple and strong data refinement approach. Here, we create WikiSplit++ by removing instances in WikiSplit where complex sentences do not entail at least one of the simpler sentences and reversing the order of reference simple sentences. Experimental results show that training with WikiSplit++ leads to better performance than training with WikiSplit, even with fewer training instances. In particular, our approach yields significant gains in the number of splits and the entailment ratio, a proxy for measuring hallucinations.

Details

Paper ID
lrec2024-main-1533
Pages
pp. 17625-17636
BibKey
tsukagoshi-etal-2024-wikisplit
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • HT

    Hayato Tsukagoshi

  • TH

    Tsutomu Hirao

  • MM

    Makoto Morishita

  • KC

    Katsuki Chousa

  • RS

    Ryohei Sasano

  • KT

    Koichi Takeda

Links