J-ClinicalBench: A Benchmark for Evaluating Large Language Models on Practical Clinical Tasks in Japanese
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Recent advances in large language models (LLMs) have accelerated the NLP applications in the medical and clinical domains. However, evaluations remain limited for non-English languages, such as Japanese, where clinical corpora are particularly scarce. To address this gap, we present J-ClinicalBench, a publicly available benchmark designed to reflect realistic Japanese clinical tasks. We first created 227 expert-authored clinical documents and newly constructed five datasets for core clinical tasks. Building on these datasets, J-ClinicalBench comprises nine clinical tasks spanning clinical language reasoning, generation, and understanding. We establish baseline performance on J-ClinicalBench by evaluating state-of-the-art proprietary and Japanese open-source LLMs, providing the first assessment of their utility in practical clinical scenarios. By releasing this benchmark, we aim to foster the development and evaluation of clinically applicable LLMs in Japanese healthcare, bridging the current gap between clinical NLP research and clinical practice.