ViKhoMT: A Vietnamese–K'Ho Neural Machine Translation Dataset and Evaluation for Community Health Communication
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
The Vietnamese government is prioritizing the socio-economic development and societal integration of ethnic minorities, including the K’Ho people. However, the lack of digital resources creates significant communication barriers, particularly in the critical domain of community health. To address this gap, we introduce ViKhoMT, a new, professionally curated Vietnamese-K’Ho parallel dataset containing approximately 10,000 sentence pairs focused on community health communication. To demonstrate the dataset’s quality and establish performance benchmarks, we conducted comprehensive evaluations by fine-tuning several pre-trained Neural Machine Translation (NMT) models. Our experiments show that a system based on the M2M100 architecture achieves BLEU scores of 60.5 for K’Ho-to-Vietnamese and 56.4 for Vietnamese-to-K’Ho, respectively. We release our dataset to the research community for free research purposes to support future studies and the development of practical translation tools for the K’Ho community. The dataset is publicly available at https://github.com/NgocTram2711/ViKhoMT.