Back to Main Conference 2022
LREC 2022main

KC4MT: A High-Quality Corpus for Multilingual Machine Translation

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/2nanfvdk345p

Abstract

The multilingual parallel corpus is an important resource for many applications of natural language processing (NLP). For machine translation, the size and quality of the training corpus mainly affects the quality of the translation models. In this work, we present the method for building high-quality multilingual parallel corpus in the news domain and for some low-resource languages, including Vietnamese, Laos, and Khmer, to improve the quality of multilingual machine translation in these areas. We also publicized this one that includes 500.000 Vietnamese-Chinese bilingual sentence pairs; 150.000 Vietnamese-Laos bilingual sentence pairs, and 150.000 Vietnamese-Khmer bilingual sentence pairs.

Details

Paper ID
lrec2022-main-588
Pages
pp. 5494-5502
BibKey
nguyen-etal-2022-kc4mt
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • VN

    Vinh Van Nguyen

  • HN

    Ha Nguyen

  • HL

    Huong Thanh Le

  • TN

    Thai Phuong Nguyen

  • TB

    Tan Van Bui

  • LP

    Luan Nghia Pham

  • AP

    Anh Tuan Phan

  • CN

    Cong Hoang-Minh Nguyen

  • VT

    Viet Hong Tran

  • AT

    Anh Huu Tran

Links