Latent Narratives at AR-MS NakbaNLP 2026: Reducing Character Errors in Arabic Manuscript Transcription: A CER Oriented System
Proceedings of the 2nd International Workshop on Nakba Narratives as Language Resources @ LREC 2026
Abstract
Historic Arabic handwritten texts present significant challenges due to varied handwriting styles, cursive structure, diverse diacritics, and inconsistent character and word sizes. In this work, we introduce Historic-Arabic-OCR, a vision-language OCR system built upon Qari-OCR, which itself is based on Qwen2-VL-2B-Instruct, and further fine- tuned using Low-Rank Adaptation (LoRA) for Arabic manuscript transcription. The proposed approach incorporates contrast enhancement using CLAHE and deterministic decoding strategies to reduce character-level errors. Our model achieves competitive performance, with a Word Error Rate (WER) of 0.28 and a Character Error Rate (CER) of 0.10 on historical Arabic texts, including low-resolution images. The final submitted system uses CLAHE prepro- cessing with deterministic greedy decoding to minimize character-level errors. Keywords: Arabic OCR, Vision-Language Models, Qwen2-VL, LoRA, CER Optimization