Ontology-Guided Synthetic Data Generation for Low-Resource Information Extraction: A Case Study in IT Heritage Domain

Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26

Abstract

Information Extraction (IE) in specialized domains often suffers from a severe cold-start problem due to the high cost of expert annotation. Recent Reverse-IE approaches leverage knowledge graphs to generate synthetic training corpora, but typically assume the availability of an existing knowledge base. In this work, we propose an ontology-driven pipeline for synthetic supervision that removes this requirement. Starting from a formal domain ontology, we introduce a stochastic motif sampling strategy that constructs schema-consistent Knowledge Graph structures with controllable topology, which are then verbalized into natural language. This ontology-first formulation also allows direct control over the data generation process, enabling oversampling of underrepresented entity types or relation patterns. Applied to the IT Heritage domain, our approach produces a fully labeled NER/RE corpus without large-scale manual annotation. Evaluation in a low-resource setting shows that while the synthetic corpus lacks the linguistic diversity of gold data, its scalability produces training sets large enough to alleviate the cold-start problem, making ontology-guided motif generation a practical strategy for domains where gold annotation is limited.