Back to Main Conference 2024
LREC-COLING 2024main

Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/2m9ra566665h

Abstract

Chinese sequence labeling tasks are sensitive to word boundaries. Although pretrained language models (PLM) have achieved considerable success in these tasks, current PLMs rarely consider boundary information explicitly. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT’s pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT’s learning, developing a semi-supervised boundary-aware PLM. To assess PLMs’ ability to encode boundaries, we introduce a novel “Boundary Information Metric” that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT version outperforms the vanilla version, not only in these tasks but also in broader Chinese natural language understanding tasks. Additionally, our proposed metric offers a convenient and accurate means of evaluating PLMs’ boundary awareness.

Details

Paper ID
lrec2024-main-0282
Pages
pp. 3179-3191
BibKey
zhang-etal-2024-chinese
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • LZ

    Longhui Zhang

  • DL

    Dingkun Long

  • MZ

    Meishan Zhang

  • YZ

    Yanzhao Zhang

  • PX

    Pengjun Xie

  • MZ

    Min Zhang

Links