Aurum at CRF Filling 2026: Modular DSPy Extractors with Qwen3-Max for Multilingual CRF Filling

Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026

Abstract

This paper describes the submission by Team Aurum to the CL4Health @ LREC 2026 Shared Task on Case Report Form (CRF) Filling from dyspnea patient clinical notes. Extracting 134 structured clinical fields using a single Large Language Model (LLM) call often leads to schema-following errors, hallucination, and poor attention over complex instructions. To address this, we propose a modular extraction pipeline built with DSPy, which decomposes the 134 CRF fields into 14 specialized, domain-specific extractors (e.g., Medical History, Lab Values, Acute Diagnoses). We conducted extensive experiments across multiple multilingual LLMs, including Llama4 Maverik, GPT-4o, GPT-4o Mini, DeepSeek-V3, Gemma-3-12B-Instruct, and Qwen-series models. Among these, Qwen3-Max (Thinking) with our optimized v2 prompts achieved the best performance on the development set with a Macro-F1 of 0.70, outperforming other evaluated models such as GPT-4o (0.68) and DeepSeek-V3 (0.66). Prompt optimization resulted in measurable gains, improving Qwen3-Max performance from 0.67 to 0.70. Using this configuration, our pipeline achieved an official Codabench Test Macro-F1 score of 0.68 in English and 0.67 in Italian, securing the 1st place ranking overall in the shared task.