The German Medical Text Corpus: Early 2026 Update

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

Abstract

Clinical text resources are a central component for the study of medical language, as well as the training and evaluation of large language models, chatbots, and artificial intelligence systems supporting clinical routines. With the German Medical Text Corpus (GeMTeX), we are currently working on the largest shareable clinical document dataset in German. The multi-centric project ensures diversity across different university hospitals, clinical domains, and text sorts. After a thorough de-identification process, the clinical texts are semantically annotated using Snomed CT, a language-independent, standardized medical ontology. While the corpus is still under active development, it is accessible upon request under controlled access conditions. As of February 2026, GeMTeX comprises more than 15k documents and 20M tokens. We refer researchers interested in the resource to visit https://kiinformatik.mri.tum.de/en/gemtex or reach out to us via gemtex.mi@mh.tum.de.