A Clinical SKOS Ontology and Evaluation Benchmark for LLM Query Generation over ICU Knowledge Graphs

Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26

Abstract

Whencliniciansquerydatabasesusingeverydaylanguage—"codebluepatients" or"sugardisease"—LargeLanguage Models must bridge a lexical gap between colloquial speech and formal clinical terminology. While highly capable cloudmodelscanleverageexternalontologieslike SKOStoresolvethese termsviaSPARQLqueries, hospital privacy regulations often mandate the use of air-gapped local LLMs (4–8B parameters). We evaluate query generation across scales (Gemini 2.0 Flash vs. LLaMA 3.1 8B) using ClinSKOS-ICU, a curated ontology of 421 ICU concepts, and ClinNLU, an evaluation benchmark. We identify a critical "Privacy Penalty": while Gemini achieves 90.2% ontology deferral under an RDF+SKOS architecture, local LLMs exhibit a 100% "Semantic Bypass" vulnerability, hardcodingformaltermsintoqueriesratherthandeferringtothegraph. ToimprovelocalLLMgrounding, weintroduce Architectural Decomposition, a pipeline that restricts the LLM to Grammar-Constrained JSON entity extraction and delegates query generation to deterministic code. This structural pivot entirely eliminates Semantic Bypass (0%) and achieves an 80.4% ontology deferral rate on an 8B model, suggesting that decoupled extraction is highly effective for enforcing W3C semantic compliance on privacy-preserving local hardware.