Request Correction
Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.
Correction Guidelines
- Click the edit button next to a field to report a correction.
- Fill in the suggested correction value for each field you want to correct.
- Provide your name and email so we can contact you if needed.
Paper Information
Overview of the MEDIQA-EVAL 2026 Shared Task on Evaluation Metrics in Medical Multimodal Question Answering
Paper Fields
Click the edit button next to a field to report a correction.
Overview of the MEDIQA-EVAL 2026 Shared Task on Evaluation Metrics in Medical Multimodal Question Answering
Evaluating clinical text generation remains challenging, as automatic metrics often correlate weakly with clinician judgments. This issue is particularly pronounced in medical multimodal question answering (MMQA), where systems must integrate visual and textual information and evaluation must capture factual accuracy, visual grounding, completeness, and overall coherence. Despite rapid progress in MMQA, there is limited consensus on clinically meaningful evaluation, and existing metrics, largely adapted from general NLG or VQA, often fail to capture domain-specific criteria. We introduce MEDIQA-EVAL 2026, a shared task on evaluation metrics for medical multimodal QA. To our knowledge, this is the first shared task focused on evaluating automatic metrics in this setting. We release a dataset of medical visual question-answer pairs annotated with multidimensional clinician judgments. Systems are evaluated by the correlation of their metric scores with expert ratings on a held-out test set. Participants explored diverse approaches, including vision-language models, retrieval-augmented judging, metric-specific classifiers, reinforcement learning, and LLM-as-a-judge frameworks. Results show that model-based evaluators achieve stronger alignment with human judgments than traditional NLG metrics, particularly on English data, while performance remains lower on Chinese, highlighting challenges in multilingual evaluation. Notably, our MEDIQA LLM-as-a-judge approach achieves strong performance across both languages.
Authors
Expand an author to correct their information. Use the remove button to request author removal, or add a new author.
PDF Attachment
You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.
Your Information
Author Declaration *
Select at least one field to correct using the edit buttons above.