Back to Main Conference 2022
LREC 2022main

Evaluating Pretraining Strategies for Clinical BERT Models

Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022)

DOI:10.63317/4c8grc539id3

Abstract

Research suggests that using generic language models in specialized domains may be sub-optimal due to significant domain differences. As a result, various strategies for developing domain-specific language models have been proposed, including techniques for adapting an existing generic language model to the target domain, e.g. through various forms of vocabulary modifications and continued domain-adaptive pretraining with in-domain data. Here, an empirical investigation is carried out in which various strategies for adapting a generic language model to the clinical domain are compared to pretraining a pure clinical language model. Three clinical language models for Swedish, pretrained for up to ten epochs, are fine-tuned and evaluated on several downstream tasks in the clinical domain. A comparison of the language models’ downstream performance over the training epochs is conducted. The results show that the domain-specific language models outperform a general-domain language model; however, there is little difference in performance of the various clinical language models. However, compared to pretraining a pure clinical language model with only in-domain data, leveraging and adapting an existing general-domain language model requires fewer epochs of pretraining with in-domain data.

Details

Paper ID
lrec2022-main-043
Pages
pp. 410-416
BibKey
lamproudis-etal-2022-evaluating
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
79-10-95546-38-2
Conference
Thirteenth Language Resources and Evaluation Conference
Location
Marseille, France
Date
20 June 2022 25 June 2022

Authors

  • AL

    Anastasios Lamproudis

  • AH

    Aron Henriksson

  • HD

    Hercules Dalianis

Links