DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large language models (LLMs) show strong reasoning abilities, but full retraining for the medical domain is often infeasible because of lacking data or compute resources. We present DeepICD-R1, a framework for efficient medical reasoning fine-tuning that unites hierarchical rewards with distilled supervision. We reformulate ICD-10-CM prediction as a reinforcement learning problem and design a hierarchical outcome-based reward that reflects the ICD code structure across chapter, category, and full-code levels. In parallel, we publish a large-scale distilled dataset of over 90k reasoning traces derived from MIMIC-IV admission notes, integrating clinical validation and official coding guidelines. Fine-tuning smaller instruction-tuned LLMs with this data and GRPO reinforcement yields consistent gains in diagnostic accuracy and reasoning coherence. Extensive ablations confirm that hierarchical supervision and verifiable outcome rewards enable competitive, domain-specialized reasoning models without additional pretraining, providing a reproducible foundation for clinical NLP research. Keywords: Clinical NLP, Large Reasoning Model, GRPO, Supervised Fine-Tuning