Towards Knowledge Graph-Grounded Evaluation of Agentic LLMs on Cybersecurity Capture-the-Flag Challenges

Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26

Abstract

Evaluating Large Language Model (LLM) agents on complex multi-step cybersecurity tasks requires structured, reproducible evaluation rubrics. We present BraceGreen, a framework that formalizes Capture-the-Flag (CTF) attack paths as knowledge graphs and uses them as gold-standard rubrics for agentic LLM evaluation. Each node in our knowledge graphs represents an attack step annotated with MITRE ATT&CK tactics, goals, commands, expected outputs, and semantic outcomes, while edges encode prerequisites, dependencies, and alternative paths. Our LangGraph-based evaluation workflow employs LLM-as-judge with chain-of-thought reasoning to semantically compare agent predictions against knowledge graph-encoded alternatives. We contribute a benchmark of 7 CTF machines with knowledge graph annotations, three evaluation modes (command prediction, goal inference, anticipated result), and integration with live machine infrastructure via virtual machines and a MCP server. Our approach bridges the gap between unstructured CTF writeups and graph-structured evaluation rubrics.