Back to Main Conference 2026
LREC 2026main

DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/37w7pv6oaaeg

Abstract

While Large Language Models (LLMs) achieve near-human performance on standard benchmarks, their capabilities often fail to generalize to complex, real-world problems. To bridge this gap, we introduce DeepQuestion, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets through controlled task transformations grounded in explicit cognitive hierarchies. Based on Bloom’s taxonomy, DeepQuestion generates (1) scenario-based problems to test the application of knowledge in noisy, realistic contexts, and (2) instruction-based prompts that require models to create new questions from a given solution path, assessing synthesis and evaluation skills. Our extensive evaluation across ten leading open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveals a stark performance decline—with accuracy dropping by up to 70%—as tasks ascend the cognitive hierarchy across evaluation settings. These findings underscore that current benchmarks overestimate true reasoning abilities and highlight the critical need for cognitively diverse evaluations to guide future LLM development.

Details

Paper ID
lrec2026-main-896
Pages
pp. 11451-11460
BibKey
khoramfar-etal-2026-deepquestion
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AK

    Ali Khoramfar

  • AR

    Ali Ramezani

  • MM

    Mohammad Mahdi Mohajeri

  • MD

    Mohammad Javad Dousti

  • MA

    Majid Nili Ahmadabadi

  • HF

    Heshaam Faili

Links