Back to Main Conference 2022
LREC 2022main

ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/4qdrjbdsv9zt

Abstract

Pre-trained language models have become crucial to achieving competitive results across many Natural Language Processing (NLP) problems. For monolingual pre-trained models in low-resource languages, the quantity has been significantly increased. However, most of them relate to the general domain, and there are limited strong baseline language models for domain-specific. We introduce ViHealthBERT, the first domain-specific pre-trained language model for Vietnamese healthcare. The performance of our model shows strong results while outperforming the general domain language models in all health-related datasets. Moreover, we also present Vietnamese datasets for the healthcare domain for two tasks are Acronym Disambiguation (AD) and Frequently Asked Questions (FAQ) Summarization. We release our ViHealthBERT to facilitate future research and downstream application for Vietnamese NLP in domain-specific. Our dataset and code are available in https://github.com/demdecuong/vihealthbert.

Details

Paper ID
lrec2022-main-035
Pages
pp. 328-337
BibKey
minh-etal-2022-vihealthbert
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • NM

    Nguyen Minh

  • VT

    Vu Hoang Tran

  • VH

    Vu Hoang

  • HT

    Huy Duc Ta

  • TB

    Trung Huu Bui

  • ST

    Steven Quoc Hung Truong

Links