To Err Is Human, How about Medical Large Language Models? Comparing Pre-trained Language Models for Medical Assessment Errors and Reliability

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/3a7qzb2t9ysm

Abstract

Unpredictability, especially unpredictability with unknown error characteristics, is a highly undesirable trait, particularly in medical patient care applications. Although large pre-trained language models (LLM) have been applied to a variety of unseen tasks with highly competitive and successful results, their sensitivity to language inputs and resulting performance variability is not well-studied. In this work, we test state-of-the-art pre-trained language models from a variety of families to characterize their error generation and reliability in medical assessment ability. Particularly, we experiment with general medical assessment multiple choice tests, as well as their open-ended and true-false alternatives. We also profile model consistency, error agreements with each other and to humans; and finally, quantify their ability to recover and explain errors. The findings in this work can be used to give further information about medical models so that modelers can make better-informed decisions rather than relying on standalone performance metrics alone.