Stack2Graph: A Structured Knowledge Representation of Stack Overflow Data for Retrieval-based Question Answering
Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26
Abstract
Community-based platforms like Stack Overflow (SO) offer a vast and diverse source of software development knowledge, combining natural language data with code snippets. Resources built from SO have been widely used to support downstream tasks in software engineering and natural language processing. However, no existing resource fully reconstructs and connects the complete range of information available on SO, leveraging its structure. We introduce Stack2Graph, a large-scale resource that preserves the forum’s structural relationships in a semantically explicit form by combining a knowledge graph with a vector database. This hybrid design captures the intrinsic links between questions, answers, comments, tags, and cross-references, bridging symbolic and vector-based representations to enable structured and multi-hop retrieval. The goal is to make SO knowledge more efficiently accessible for LLM-based systems and easier to integrate into downstream applications. To evaluate its impact, we integrate Stack2Graph into a zero-shot pipeline for multiple-choice question answering on CodeMMLU. Results show that retrieval augmentation particularly benefits mid-sized general-purpose models, with substantial gains in API- and framework-oriented tasks.