Back to Main Conference 2018
LREC 2018main

Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

DOI:10.63317/3eijvi7dhib3

Abstract

Although there are increasing and significant ties between China and Portuguese-speaking countries, there is not much parallel corpora in the Chinese-Portuguese language pair. Both languages are very populous, with 1.2 billion native Chinese speakers and 279 million native Portuguese speakers, the language pair, however, could be considered as low-resource in terms of available parallel corpora. In this paper, we describe our methods to curate Chinese-Portuguese parallel corpora and evaluate their quality. We extracted bilingual data from Macao government websites and proposed a hierarchical strategy to build a large parallel corpus. Experiments are conducted on existing and our corpora using both Phrased-Based Machine Translation (PBMT) and the state-of-the-art Neural Machine Translation (NMT) models. The results of this work can be used as a benchmark for future Chinese-Portuguese MT systems. The approach we used in this paper also show a good example on how to boost performance of MT systems for low-resource language pairs.

Details

Paper ID
lrec2018-main-236
Pages
N/A
BibKey
liu-etal-2018-chinese
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-00-9
Conference
Eleventh International Conference on Language Resources and Evaluation
Location
Miyazaki, Japan
Date
7 May 2018 12 May 2018

Authors

  • SL

    Siyou Liu

  • LW

    Longyue Wang

  • CL

    Chao-Hong Liu

Links