A Cheap Lunch: Synthetic Annotation With Reduced Human Effort for Medical Text Mining
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Electronic Health Records are rich resources of patient knowledge and information among which knowledge about the functioning of patients as defined in the International Classification of Functioning (ICF) by the WHO. However, the patient notes have yet to be explored as the knowledge is packaged in sometimes cryptic language exchanged between caretakers. Recent research started to use NLP techniques to extract this knowledge but often requires laborious annotation. In this paper, we report on how the annotation can (partly) be done by a generative LLM, both for ICF categories that were previously manually annotated and for new ICF categories for which there was no annotation. We show that a domain specific encoder finetuned with both manual and synthetic annotations outperforms finetuning with just the manual annotations on a dedicated test set that was adapted for the new categories with minimal manual effort. We also assessed the quality of the synthetic annotations of the training data. Our process shows how competitive text classifiers for medical text mining can be developed and extended to new categories with minimal manual effort by experts.