Baflah-lamri at NAKBA-NLP 2026: Manual Ground Truth Enrichment
Proceedings of the 2nd International Workshop on Nakba Narratives as Language Resources @ LREC 2026
Abstract
This paper presents a detailed description of the team’s methodology methodology in participating in Subtask 1 (Transcription Track) of the NAKBA NLP 2026 Shared Task for Arabic Manuscript Understanding. We present a rigorous approach to line-level manual transcription of historical Arabic manuscripts derived from the Omar Al-Saleh memoir collection (1951-1965). Our methodology emphatisez accuracy, consistency, and adherence to diplomatic transcription principles, while addressing the unique palaeographic and physical challenges of Arabic handwriting, such as writing speed, orthographic variation, and the impact of writing tools (e.g., immediate strike-throughs and ink spatter). The work guided by strict transcription guidelines and contextual verification protocols, matching cropped line images with full-page images to resolve ambiguities and automated cropping issues. The team successfully transcribed the entire assigned batch of 500 lines (100% completion rate) across 368 unique pages, producing reference data comprising 6,719 words and 37,646 characters. This effort contributes to providing highly reliable Ground Truth data, serving as an essential foundation for training and evaluating Handwritten Text Recognition (HTR) models for Arabic manuscripts.