SloCal-Net at MEDIQA-Eval 2026: Investigating the Impact of Reasoning and External Context on Medical Answer Grading

Proceedings of the 8th Workshop on Clinical Natural Language Processing (Clinical NLP) @ LREC 2026

Abstract

Automated evaluation of multimodal medical answers is essential for scalable safety assessment, yet it remains difficult to align automatic scores with expert judgment across languages and image modalities. We describe SloCal-Net’s systems for the MEDIQA-EVAL 2026 shared task, framing evaluation as rubric-conditioned multimodal judging: the judge receives the question, image(s), candidate answer, and task-specific criteria, and outputs criterion-level scores and an overall rating. Evidence retrieval was initialized using ChatGPT Deep Research, producing a 25-document clinical corpus used for lightweight retrieval-augmented grounding. On the official leaderboard, our best submission (GPT-5-mini with web search and RAG) achieved Pearson correlations of 0.466 on English and 0.260 on Chinese expert ratings. In post-competition experiments with open-source judges, the best English Pearson reached 0.272 with GLM-4.6V and 0.212 with Qwen3-VL-30B-Thinking, while Chinese correlations were lower, highlighting remaining gaps in multilingual calibration and image–text grounding.