Improving Text2Cypher with Confidence-Based Test-Time Strategies

Proceedings of the Knowledge Graphs and Large Language Models Workshop (KG-LLM) @ LREC26

Abstract

Advances in Large Language Models (LLMs) have made it possible to convert natural language questions into executable database queries. Text2Cypher focuses on graph databases, converting user questions into queries and providing natural language access to graph-structured data. While significant progress has been made through prompt design, fine-tuning, and iterative refinement, less attention has been given to adaptive test-time strategies that combine multiple generated outputs. In this work, we investigate the impact of confidence-based test-time strategies specifically on the Text2Cypher task by evaluating the model’s traces, which are the sequence of tokens generated during the construction of the query. We show that reasoning models generate diverse query candidates but frequently produce syntactic errors and incomplete structures, limiting executability. On the other hand, instruction-tuned models yield more reliable outputs but lack sufficient diversity for effective confidence-based selection. Further, by tuning diversity parameters such as top‑p and temperature, we observe consistent improvements in both query accuracy and execution success. Experiments across multiple instruction-tuned models confirm that combining diversity-controlled generation with confidence-aware inference provides a practical, model-agnostic method for improving query generation.