Back to Main Conference 2024
LREC-COLING 2024main

Humanistic Buddhism Corpus: A Challenging Domain-Specific Dataset of English Translations for Classical and Modern Chinese

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5g2srad8dizv

Abstract

We introduce the Humanistic Buddhism Corpus (HBC), a dataset containing over 80,000 Chinese-English parallel phrases extracted and translated from publications in the domain of Buddhism. HBC is one of the largest free domain-specific datasets that is publicly available for research, containing text from both classical and modern Chinese. Moreover, since HBC originates from religious texts, many phrases in the dataset contain metaphors and symbolism, and are subject to multiple interpretations. Compared to existing machine translation datasets, HBC presents difficult unique challenges. In this paper, we describe HBC in detail. We evaluate HBC within a machine translation setting, validating its use by establishing performance benchmarks using a Transformer model with different transfer learning setups.

Details

Paper ID
lrec2024-main-0737
Pages
pp. 8406-8417
BibKey
wong-etal-2024-humanistic
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • YW

    Youheng W. Wong

  • NP

    Natalie Parde

  • EK

    Erdem Koyuncu

Links