Back to Main Conference 2026
LREC 2026main

A Dataset of Wolof Ajami Manuscripts for HTR and OCR

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4pz98ojeeqpw

Abstract

We present the first ever dataset of manually segmented and transcribed Ajami manuscripts written in Wolof. The term Ajami refers to modified Arabic-script orthographies used to transcribe African languages. Handwritten text recognition (HTR) and optical character recognition (OCR) models for Arabic-script languages perform poorly on African languages written in Ajami orthographies because these languages are not represented in the pre-training data of the models. This leads to recognition models being unable to extract unique Arabic-script letters and ubiquitous diacritics used in African languages, and struggling to adapt to various calligraphy styles used across Africa. We release the following as an open-source dataset: an ALTO formatting of high-quality images of handwritten and printed, 20th–century Wolof manuscripts; manual segmentation (region and line); and manual transcriptions. We extend our contribution by evaluating several Arabic-script recognition models intended for historical manuscripts and find they produce character error rates (CER) of 61–81%. Transcriptions produced by the evaluated recognition models, as well as a keyboard to transcribe Wolof Ajami manuscripts, are released as well. The digitally transcribed text in the dataset can also be utilized for various natural language processing (NLP) and historical linguistic tasks.

Details

Paper ID
lrec2026-main-253
Pages
pp. 3234-3239
BibKey
yousuf-etal-2026-dataset
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • OY

    Oreen Yousuf

  • ED

    Elhadji Djibril Diagne

  • CH

    Christian Høgel

  • BM

    Beata Megyesi

  • JN

    Joakim Nivre

Links