Back to Main Conference 2014
LREC 2014main

Constructing a Chinese—Japanese Parallel Corpus from Wikipedia

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014)

DOI:10.63317/5i4tcv4g6gui

Abstract

Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/chu/resource/wiki_zh_ja.tgz.

Details

Paper ID
lrec2014-main-209
Pages
pp. 642-647
BibKey
chu-etal-2014-constructing
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-9517408-8-4
Conference
Ninth International Conference on Language Resources and Evaluation
Location
Reykjavik, Iceland
Date
26 May 2014 31 May 2014

Authors

  • CC

    Chenhui Chu

  • TN

    Toshiaki Nakazawa

  • SK

    Sadao Kurohashi

Links