HomeLREC 2026WorkshopsCLINICALNLPlrec2026-ws-clinicalnlp-02
Back to CLINICALNLP 2026
LREC 2026workshop

SUAT-BMI at MEDIQA-EVAL 2026: An Ensemble Approach to Language Models as Judges for Automatic Rating of Medical Responses

Proceedings of the 8th Workshop on Clinical Natural Language Processing (Clinical NLP) @ LREC 2026

DOI:10.63317/2kdt525sk8is

Abstract

The MEDIQA-EVAL 2026 shared task focuses on developing automatic evaluation metrics for LLM-generated responses in dermatology and wound care. While LLMs have shown promise as judge models, the reliability of these metrics remains underexplored. In this work, we study how well judge models can approximate human expert ratings across clinical evaluation criteria. We evaluate multiple approaches, including few-shot prompting, BERT fine-tuning, and retrieval-augmented generation (RAG), and combine them in an ensemble framework. Our method achieves a correlation score of 0.481, ranking first among 41 participating teams. Our results provide insight into the reliability of LLM-based evaluation metrics and highlight their potential for scalable clinical assessment.

Details

Paper ID
lrec2026-ws-clinicalnlp-02
Pages
pp. 12-18
BibKey
peng-etal-2026-suat
Editors
Asma Ben Abacha, Steven Bethard, Danielle Bitterman, Tristan Naumann, Kirk Roberts
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 8th Workshop on Clinical Natural Language Processing (Clinical NLP) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • XP

    Xinzhe Peng

  • LE

    Liyuan E

  • KF

    Kun Feng

  • JL

    Jielin Li

  • YT

    Yuxuan Tang

  • ZL

    Zhao Li

Links