Improving Public Health Safety in Low-Resource Languages Using a Human-Verified Health Misinformation Corpus and Large Language Models
Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)
Abstract
The proliferation of health misinformation in Low-Resource Languages (LRLs) poses a severe threat to public health, yet automated detection remains critically under-studied due to the scarcity of high-quality benchmarks. We address this gap by introducing Nep-Health-Misinfo, a novel human-verified corpus for health misinformation identification in Nepali. The dataset was developed by adapting four foundational benchmarks (Monkeypox-V1, Monkeypox-V2, COVID-19, and CoAID) through a systematic Machine Translation Post-Editing (MTPE) protocol involving native experts. Our evaluation of Neural Machine Translation (NMT) systems reveals a significant translation asymmetry: while state-of-the-art (SOTA) systems achieve a BLEU score of 43.21 on factual health data, performance degrades sharply on deceptive narratives, with BLEU and TER scores dropping to 19.11 and 62.42, respectively. To establish robust baselines, we benchmark seven recent open-weight Large Language Models (LLMs), including Qwen2.5-7B-Instruct, Gemma-3-4B-IT, and Ministral-8B-Instruct, across zero-shot and few-shot settings. For the few-shot evaluation, we compare stochastic sampling against a K-means centroid-based approach for semantically representative exemplar selection. Experimental results indicate that Qwen2.5-7B-Instruct achieves a peak Macro F1-score of 0.8488, improving over its zero-shot performance (0.7188) on the same dataset. Our findings demonstrate that while few-shot prompting effectively mitigates distribution shifts in low-resource medical contexts, performance remains highly sensitive to the semantic density of exemplars. This work provides the first human-verified Nepali health misinformation corpus. All code and resources are available at https://github.com/SUJAL390/Nep-Health-Misinfo-CHIPSAL-LREC.