Back to Main Conference 2026
LREC 2026main

Scripting History: A Diachronic Urdu Text and Image Corpus from the 18Th to 19Th Centuries

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/48tdw3hsxp47

Abstract

This paper presents the Diachronic Urdu Text and Image Corpus, a one-million-word resource covering Urdu’s development across the 18th and 19th centuries. The corpus is compiled from 328 printed books published between 1800 and 1950, representing a diverse range of genres, authors, and publishers. A 140,000-word sub-corpus has been manually annotated with Urdu part-of-speech tags to facilitate linguistic and computational analysis. The dataset enables systematic investigation of historical changes in Urdu orthography, morphology, and syntax, providing new insights into the language’s history and standardization. To preserve the original printed form, each text is paired with its corresponding page image, creating the first multimodal diachronic corpus for Urdu. The paper outlines the corpus compilation pipeline, digitization workflow, text-image alignment, and annotation strategy designed to ensure accuracy, consistency, and authenticity. This multimodal Urdu diachronic corpus establishes a benchmark for research in computational linguistics, digital humanities, and South Asian language technology, supporting corpus-based exploration of Urdu’s linguistic history and cultural heritage.

Details

Paper ID
lrec2026-main-127
Pages
pp. 1622-1632
BibKey
shams-etal-2026-scripting
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • SS

    Sana Shams

  • SR

    Sahar Rauf

  • AM

    Asad Mustafa

  • MJ

    Muhammad Zeeshan Javed

  • QA

    Qurat-ul-Ain Akram

  • SH

    Sarmad Hussain

  • MB

    Miriam Butt

Links