HomeLREC 2026WorkshopsCL4HEALTHlrec2026-ws-cl4health-26
Back to CL4HEALTH 2026
LREC 2026workshop

MedGore: An Approach and a Dataset for Identification of Sensitive Medical Images

Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026

DOI:10.63317/5f7uiuto4ph5

Abstract

Medical images are invaluable in illustrating health issues for the patients. While biomedical publications are a good source of such images, some of the images are not appropriate for the patient viewing without a warning. To enable development of automated tools for selection of patient-safe images and generation of warnings, we created a dataset MedGore of over 78,000 sensitive medical images and 183,000 non-sensitive images published in the biomedical literature. The sensitive content includes gore, severe disease, nudity, surgical openings, internal organs, and other medical images of this nature. The set of the manually identified seed 300 images was expanded using a combination of human curation and a nearest neighbor clustering algorithm. The quality of the automatically labeled images was evaluated manually, yielding a total of more than 4,000 doubly-manually annotated images. The automatically labeled images proved to approach the utility of the manually labeled images for training the models in our experiments that validated the dataset in the task of labeling unseen images using the image features, the figure captions or both.

Details

Paper ID
lrec2026-ws-cl4health-26
Pages
pp. 296-306
BibKey
gayen-etal-2026-medgore
Editors
Deepak Gupta, Paul Thompson, Sophia Ananiadou, Dina Demner-Fushman
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • SG

    Soumya Gayen

  • RM

    Rory Mulcahey

  • RL

    Russell Loane

  • DD

    Dina Demner-Fushman

  • DG

    Deepak Gupta

Links