Back to Main Conference 2026
LREC 2026main

MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4nm93hckcaf2

Abstract

As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for reliable deployment. We introduce MazeEval, a benchmark designed to evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks without visual input. Using a function-calling interface, models navigate mazes of varying complexity (5 x 5 to 15 x 15 grids) using only coordinate feedback and distance-to-wall information. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI’s O3 achieves perfect navigation up to 30 x 30 mazes, other models exhibit catastrophic failure beyond 9 x 9 mazes, with 100% of failures attributed to excessive looping behavior. We document significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning emerges from linguistic patterns rather than language-agnostic mechanisms. These results highlight that spatial intelligence remains fundamentally constrained by training data availability, with important implications for global deployment of LLM-powered autonomous systems.

Details

Paper ID
lrec2026-main-027
Pages
pp. 407-418
BibKey
einarsson-2026-mazeeval
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • HE

    Hafsteinn Einarsson

Links