Back to Main Conference 2004
LREC 2004main
An Annotated German-Language Medical Text Corpus as Language Resource
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004)
Abstract
We describe the structure of a German-language corpus which contains a variety of medical text genres. Clinical documents (discharge summaries, pathology, histology and surgery reports) are distinguished from non-clinical ones (textbook articles and consumer health care documents from a Web portal). After introducing a medical extension of the general-language STTS tagset which accounts for unique features of the medical sublanguage encountered in these documents, we discuss some of the quantitative properties of the annotations (e.g., distribution patterns of part-of-speech tags).