Back to Main Conference 2026
LREC 2026main

ViKhoMT: A Vietnamese–K'Ho Neural Machine Translation Dataset and Evaluation for Community Health Communication

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4tvv9uk7fqgn

Abstract

The Vietnamese government is prioritizing the socio-economic development and societal integration of ethnic minorities, including the K’Ho people. However, the lack of digital resources creates significant communication barriers, particularly in the critical domain of community health. To address this gap, we introduce ViKhoMT, a new, professionally curated Vietnamese-K’Ho parallel dataset containing approximately 10,000 sentence pairs focused on community health communication. To demonstrate the dataset’s quality and establish performance benchmarks, we conducted comprehensive evaluations by fine-tuning several pre-trained Neural Machine Translation (NMT) models. Our experiments show that a system based on the M2M100 architecture achieves BLEU scores of 60.5 for K’Ho-to-Vietnamese and 56.4 for Vietnamese-to-K’Ho, respectively. We release our dataset to the research community for free research purposes to support future studies and the development of practical translation tools for the K’Ho community. The dataset is publicly available at https://github.com/NgocTram2711/ViKhoMT.

Details

Paper ID
lrec2026-main-687
Pages
pp. 8725-8739
BibKey
truong-etal-2026-vikhomt
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • TT

    Tram Truong

  • VN

    Vinh Nguyen

  • DT

    Dang Van Thin

  • NN

    Ngan Nguyen

Links