GeoBenchmark: Probing Large Language Models for Geo-Spatial Knowledge

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Large Language Models (LLMs) demonstrate strong factual recall of general-purpose knowledge but struggle with grounded geospatial knowledge. To measure and help probe LLMs for spatial knowledge, we present GeoBenchmark, a benchmark for evaluating geographic commonsense along three core spatial relations: direction, distance, and topology. Using data extracted from YAGO2geo and Ordnance Survey ward geometries, spatial relations were formalized as structured triplets and systematically transformed into balanced binary (Yes/No) and Multiple-Choice (MCQ) question-answer pairs. Besides, we consider atomic and composite questions based on the number of spatial relations involved. The resulting dataset comprises 26k binary and 13k MCQ samples, uniformly distributed across atomic, binary, and ternary relation levels. We establish baselines with LLaMA-8B and Mistral-7B under zero-shot prompting, achieving 52-63% accuracy on atomic questions but below 35% on ternary relations, which exposes the models’ limited compositional spatial understanding and strong option bias. GeoBenchmark provides a comprehensive, reproducible resource for probing and advancing LLMs’ geographic commonsense, paving the way for future research in spatial and geographic probing of LLMs as well as knowledge editing.