Extracting Medication Instructions from Dutch General Practice Electronic Health Records with Local Natural Language Processing
Proceedings of the 8th Workshop on Clinical Natural Language Processing (Clinical NLP) @ LREC 2026
Abstract
The extraction of structured medication prescription data from unstructured clinical text remains a critical challenge for clinical research and data standardization. This study investigates the application of Natural Language Processing (NLP) techniques to Dutch electronic health records (EHRs) from the Julius General Practitioners Network. The goal is to automatically extract key prescription attributes including dosage, duration, and medication unit and prepare them for integration into the ConcePTION Common Data Model, to support scalable pharmacoepidemiological research. We compare a lightweight rule-based system with transformer-based models (RobBERT and MedRoBERTa) under the technical constraints of a Trusted Research Environment, where external resources and cloud-based solutions are restricted. Using a dataset of 1,819 manually annotated records, the approaches are evaluated on predictive performance and computational costs. Results show that the rule-based system achieves strong accuracy and computational costs for structured patterns, while transformer-based models demonstrate greater robustness to linguistic variability. However, both approaches encounter difficulties with ambiguous dosage formats and long treatment durations. Our findings indicate that NLP methods can substantially improve the structuring of Dutch prescription data and support scalable pharmacoepidemiological research. Future work should focus on improving generalization and expanding annotated datasets to enhance model reliability.