MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for reliable deployment. We introduce MazeEval, a benchmark designed to evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks without visual input. Using a function-calling interface, models navigate mazes of varying complexity (5 x 5 to 15 x 15 grids) using only coordinate feedback and distance-to-wall information. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI’s O3 achieves perfect navigation up to 30 x 30 mazes, other models exhibit catastrophic failure beyond 9 x 9 mazes, with 100% of failures attributed to excessive looping behavior. We document significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning emerges from linguistic patterns rather than language-agnostic mechanisms. These results highlight that spatial intelligence remains fundamentally constrained by training data availability, with important implications for global deployment of LLM-powered autonomous systems.