HomeLREC 2026WorkshopsCMLClrec2026-ws-cmlc-16
Back to CMLC 2026
LREC 2026workshop

The German Medical Text Corpus: Early 2026 Update

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

DOI:10.63317/3xopdv4wdd93

Abstract

Clinical text resources are a central component for the study of medical language, as well as the training and evaluation of large language models, chatbots, and artificial intelligence systems supporting clinical routines. With the German Medical Text Corpus (GeMTeX), we are currently working on the largest shareable clinical document dataset in German. The multi-centric project ensures diversity across different university hospitals, clinical domains, and text sorts. After a thorough de-identification process, the clinical texts are semantically annotated using Snomed CT, a language-independent, standardized medical ontology. While the corpus is still under active development, it is accessible upon request under controlled access conditions. As of February 2026, GeMTeX comprises more than 15k documents and 20M tokens. We refer researchers interested in the resource to visit https://kiinformatik.mri.tum.de/en/gemtex or reach out to us via gemtex.mi@mh.tum.de.

Details

Paper ID
lrec2026-ws-cmlc-16
Pages
pp. 98-100
BibKey
hofenbitzer-etal-2026-german
Editors
Piotr Bański, Dawn Knight, Marc Kupietz, Andreas Witt, Alina Wróblewska
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • JH

    Justin Hofenbitzer

  • CL

    Christina Lohr

  • FM

    Frank Meineke

  • ML

    Markus Löffler

  • MB

    Martin Boeker

Links