MedAware at MEDIQA-EVAL 2026: Vision-Language Model Fine-Tuning with Logprob-Based Score Calibration for Medical Response Evaluation

Proceedings of the 8th Workshop on Clinical Natural Language Processing (Clinical NLP) @ LREC 2026

Abstract

We present MedAware, our MEDIQA-EVAL 2026 system for predicting human ratings of medical QA responses from text and images. We fine-tune Qwen3-VL models (4B/8B/32B) with supervised fine-tuning (SFT), and study GRPO as an optional second stage under both LoRA and full-parameter settings. To handle severe label skew and unstable correlation metrics, we use logprob-based continuous scoring with quantile calibration, converting token probabilities into calibrated metric scores without retraining. This reduces prediction collapse on skewed dimensions and improves metric stability in both English and Chinese. The approach follows the official reference-based shared-task setup and is designed to produce meaningful metric estimates even under extreme class imbalance. In the official shared-task submission setting (8B-LoRA SFT with discrete scoring), our system ranked 3rd on English and 1st among participants on Chinese. Separately, in post-competition offline re-evaluations with logprob scoring, the best tested configuration reaches 0.449 EN-ALL and 0.308 ZH-ALL, while SFT initialization remains critical for effective GRPO.