Historical Medical Knowledge Graphs and Ontologies from the Medical History of British India Corpus (1850-1950)
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This research presents a reproducible framework for constructing biomedical knowledge graphs and ontologies from digitized historical archives. Focusing on the Medical History of British India corpus (468 reports; ∼22.5M words; 1850–1950), our pipeline combines BioBERT-based entity recognition, LLM-guided relation extraction with LLM-based filtering, and clustering-based ontology induction. Reliability is strengthened through canonicalization, schema mapping to standardized biomedical relation types, and multi-metric edge scoring with temporal decay; a manual evaluation of 500 validated triples yields 0.892 precision. The resulting resources comprise 282,882 extracted relations, consolidated into 22,360 unique surface forms and organized into 71 thematic clusters. Frequent categories include After Treatment (∼1,242 mentions), Date of Inoculation (∼540), and diverse causal relations, while the induced ontology highlights six epidemic diseases: plague, cholera, malaria, kala azar, leprosy, and smallpox together with their characteristic interventions (e.g., quinine therapy, vaccination campaigns, hospital disinfection). Temporal analyses capture historically plausible trajectories: plague interventions peaking in the 1890s, cholera’s long-run decline, and tuberculosis departments rising after 1910. All code, relation inventories, ontologies, and visualizations are released in a GitHub Repository, enabling reproducibility and supporting research in historical NLP, biomedical informatics, and digital humanities.