Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-family evaluation remains limited. We evaluate 39 configurations spanning three model families (Qwen3, Claude Haiku 4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Cross-family differences produce substantially larger performance gaps (2.0–2.2×, F1 = 0.761 vs. 0.343) than parameter scaling within families (83% gain from 4B to 32B scaling), and a partial-correlation analysis rules out tokenizer design as a confound for within-family scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade, showing inconsistent compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (ρ = 0.28–0.42) across all families, yet identify systematic failures on common words with unusual orthography ("data", "loll", "acai": 83–91% human success, 94–98% model miss rate). These failures point to over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns.