Request Correction

Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.

Correction Guidelines

Click the edit button next to a field to report a correction.
Fill in the suggested correction value for each field you want to correct.
Provide your name and email so we can contact you if needed.

View all submitted correction requests

Paper Information

lrec2026-main-817

Building Effective Japanese Medical LLMs with an Open Recipe for Domain Adaptation through Continued Pre-training

View lrec2026-main-817.pdf

Paper Fields

Click the edit button next to a field to report a correction.

Title

Building Effective Japanese Medical LLMs with an Open Recipe for Domain Adaptation through Continued Pre-training

Abstract

In high-stakes domains such as medicine, ensuring transparency of the training corpus is essential, with careful consideration of local healthcare landscapes; however, the majority of existing medical large language models (LLMs) have not disclosed the details of their training corpora. Here, we introduce an open recipe for domain adaptation of LLMs to the Japanese medical domain. We employed fully open-source Japanese general-domain LLMs as base models, whose pre-training datasets are also disclosed. To establish effective corpora for domain adaptation through continued pre-training, we started with small-scale medical datasets and ultimately constructed a medical corpus consisting of 79.6B tokens, incorporating local clinical guidelines, medical textbooks, and other domain-specific resources. The resulting LLM from continued pre-training, namely SIP-med-llm-8x13B, with an active parameter count of 22B, demonstrated favorable accuracy on benchmarks including the Japanese National Medical Examination. This performance was comparable to that of 70B-parameter open-weight models whose construction details remain non-transparent. This represents the first case in the Japanese medical field where complete corpus details have been disclosed for fully from-scratch development, providing important insights for future efforts to construct medical LLMs tailored to the specific characteristics of local contexts. The model is available publicly at this Hugging Face repository: https://huggingface.co/SIP-med-LLM/SIP-jmed-llm-2-8x13b-OP-instruct.

Authors

Expand an author to correct their information. Use the remove button to request author removal, or add a new author.

PDF Attachment

You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.

Drag & drop a PDF here, or click to select

Your Information

Name

Comment

Author Declaration *

I declare that I have notified all co-authors of the proposed corrections and obtained their consent, and that all modifications adhere to research ethics standards and the LREC correction policy.

Select at least one field to correct using the edit buttons above.