Large Language Model-Based Post-OCR Correction for Low-Resource Kazakh Scripts

Proceedings of the Third Workshop on Computation and Written Language (CAWL 2026) @ LREC 2026

Abstract

Kazakh is written in the Arabic, Cyrillic, and Latin script which present unique challenges for OCR and post-OCR correction research. Despite this complexity, NLP research on Kazakh and its low-resource scripts remains extremely scarce. We analyze common OCR error patterns in all three Kazakh scripts using Tesseract and evaluate four large language models (LLMs) for post-OCR correction using minimal, confusion-aware, and few-shot prompting strategies. Our results reveal three systematic, writing-system-driven failure modes in LLM-based post-OCR correction: script switching, hallucination, and instruction-following breakdown. Arabic script post-OCR correction remains unsuccessful across all setups. In the Cyrillic script, post-OCR correction improvements are minimal due to the high baseline OCR performance on Cyrillic. For the Latin script, few-shot prompting with Gemini 2.5 Flash yields substantial improvements, reducing CER by 8.58 points and WER by 32.49 points to levels better than high-resource Kazakh Cyrillic script OCR. These findings demonstrate that LLM post-OCR correction failure modes are predictable from writing system properties such as script resource asymmetry and co-existing script dominance and demonstrate the need for typology-aware evaluation frameworks for multi-script and under-resourced languages.