Back to Main Conference 2024
LREC-COLING 2024main

A Dataset for Named Entity Recognition and Entity Linking in Chinese Historical Newspapers

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/2bu8rpadfdod

Abstract

In this study, we present a novel historical Chinese dataset for named entity recognition, entity linking, coreference and entity relations. We use data from Chinese newspapers from 1872 to 1949 and multilingual bibliographic resources from the same period. The period and the language are the main strength of the present work, offering a resource which covers different styles and language uses, as well as the largest historical Chinese NER dataset with manual annotations from this transitional period. After detailing the selection and annotation process, we present the very first results that can be obtained from this dataset. Texts and annotations are freely downloadable from the GitHub repository.

Details

Paper ID
lrec2024-main-0035
Pages
pp. 385-394
BibKey
blouin-etal-2024-dataset
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • BB

    Baptiste Blouin

  • CA

    Cécile Armand

  • CH

    Christian Henriot

Links