HomeLREC 2026WorkshopsKGLLMlrec2026-ws-kgllm-02
Back to KGLLM 2026
LREC 2026workshop

OntoBook: Ontology-Grounded Synthetic Textbooks for Medical Encoder Pretraining

Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26

DOI:10.63317/37ik5npnrvv6

Abstract

We present OntoBook, a method that converts medical ontology structure into pretraining signal for encoder language models. Our approach has three stages: random walks through ontology graphs capture hierarchical and causal relations between medical codes, a large language model reformulates these walks into fluent textbook-style prose, and the resulting text is used to train ModernCamemBERT, a 149M-parameter French encoder, with two objectives on the same data: masked language modeling and relation prediction between code pairs. On three French medical coding benchmarks (FRACCO, Cantemist-FR, Distemist-FR), OntoBook achieves significant improvements over MLM-only pretraining, with +2.5 micro-F1 on FRACCO and +8.0 micro-F1 on Distemist. We find that alignment between objectives is necessary: misaligned training, where each task uses different data, causes a 30-point degradation. We release 1.3 million LLM-reformulated medical textbooks across three French ontologies (CIM-10, CCAM, ATC) and pretrained model checkpoints.

Details

Paper ID
lrec2026-ws-kgllm-02
Pages
pp. 11-19
BibKey
touchent-etal-2026-ontobook
Editors
Gilles Sérasset, Katerina Gkirtzou, Michael Cochez, Jan-Christoph Kalo
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • RT

    Rian Touchent

  • Éd

    Éric de la Clergerie

Links