ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

Abstract

Pre-trained language models have become crucial to achieving competitive results across many Natural Language Processing (NLP) problems. For monolingual pre-trained models in low-resource languages, the quantity has been significantly increased. However, most of them relate to the general domain, and there are limited strong baseline language models for domain-specific. We introduce ViHealthBERT, the first domain-specific pre-trained language model for Vietnamese healthcare. The performance of our model shows strong results while outperforming the general domain language models in all health-related datasets. Moreover, we also present Vietnamese datasets for the healthcare domain for two tasks are Acronym Disambiguation (AD) and Frequently Asked Questions (FAQ) Summarization. We release our ViHealthBERT to facilitate future research and downstream application for Vietnamese NLP in domain-specific. Our dataset and code are available in https://github.com/demdecuong/vihealthbert.

Resources

Details

Paper ID

lrec2022-main-035

Pages

pp. 328-337

DOI

10.63317/4qdrjbdsv9zt

BibKey

minh-etal-2022-vihealthbert

Editors

Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis2020

Publisher

European Language Resources Association (ELRA)

ISSN

2522-2686

ISBN

79-10-95546-38-2

Conference

Thirteenth Language Resources and Evaluation Conference

Location

Marseille, France

Date

20 - 25 June 2022

Authors

NM
Nguyen Minh
VT
Vu Hoang Tran
VH
Vu Hoang
HT
Huy Duc Ta
TB
Trung Huu Bui
ST
Steven Quoc Hung Truong

Links

URL

DOI