Back to Main Conference 2024
LREC-COLING 2024main

CSSWiki: A Chinese Sentence Simplification Dataset with Linguistic and Content Operations

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/37ni46z23snp

Abstract

Sentence Simplification aims to make sentences easier to read and understand. With most effort on corpus development focused on English, the amount of annotated data is limited in Chinese. To address this need, we introduce CSSWiki, an open-source dataset for Chinese sentence simplification based on Wikipedia. This dataset contains 1.6k source sentences paired with their simplified versions. Each sentence pair is annotated with operation tags that distinguish between linguistic and content modifications. We analyze differences in annotation scheme and data statistics between CSSWiki and existing datasets. We then report baseline sentence simplification performance on CSSWiki using zero-shot and few-shot approaches with Large Language Models.

Details

Paper ID
lrec2024-main-0375
Pages
pp. 4205-4213
BibKey
liu-lee-2024-csswiki
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • FL

    Fengkai Liu

  • JL

    John S. Y. Lee

Links