HomeLREC 2026WorkshopsKGLLMlrec2026-ws-kgllm-15
Back to KGLLM 2026
LREC 2026workshop

Towards Knowledge Graph-Grounded Evaluation of Agentic LLMs on Cybersecurity Capture-the-Flag Challenges

Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26

DOI:10.63317/4ne74wm45iti

Abstract

Evaluating Large Language Model (LLM) agents on complex multi-step cybersecurity tasks requires structured, reproducible evaluation rubrics. We present BraceGreen, a framework that formalizes Capture-the-Flag (CTF) attack paths as knowledge graphs and uses them as gold-standard rubrics for agentic LLM evaluation. Each node in our knowledge graphs represents an attack step annotated with MITRE ATT&CK tactics, goals, commands, expected outputs, and semantic outcomes, while edges encode prerequisites, dependencies, and alternative paths. Our LangGraph-based evaluation workflow employs LLM-as-judge with chain-of-thought reasoning to semantically compare agent predictions against knowledge graph-encoded alternatives. We contribute a benchmark of 7 CTF machines with knowledge graph annotations, three evaluation modes (command prediction, goal inference, anticipated result), and integration with live machine infrastructure via virtual machines and a MCP server. Our approach bridges the gap between unstructured CTF writeups and graph-structured evaluation rubrics.

Details

Paper ID
lrec2026-ws-kgllm-15
Pages
pp. 144-154
BibKey
schlr-etal-2026-knowledge
Editors
Gilles Sérasset, Katerina Gkirtzou, Michael Cochez, Jan-Christoph Kalo
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • DS

    Daniel Schlör

  • MB

    Marius Bohn

  • MW

    Maximilian Wolf

  • KB

    Kevin Bergner

  • CG

    Christian Goldschmied

  • AH

    Andreas Hotho

Links