HomeLREC 2026WorkshopsLT4HALAlrec2026-ws-lt4hala-17
Back to LT4HALA 2026
LREC 2026workshop

Building a Corpus and Database for Rare and Undeciphered Scripts

Proceedings of the Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026

DOI:10.63317/3w96kx3i86uo

Abstract

Historical sources written in rare or undeciphered scripts represent an immense but underexploited part of the world’s cultural and linguistic heritage. Their study is often hindered by fragmentary preservation, non-standard symbol systems, and the absence of interoperable digital resources. While recent advances in imaging, transcription, and computational analysis have improved access to historical texts, most tools rely on large quantities of labeled data and standardized encodings, requirements that are rarely met for rare or unknown writing systems. This paper presents the design and methodology of a new corpus and database dedicated to rare and undeciphered scripts worldwide. The resource integrates high-quality images, transliterations, transcriptions, linguistic annotations, and metadata within a unified data model tailored for low-resource and non-standard scripts. By adhering to FAIR principles and existing standards for linguistic and cultural heritage data, the database enables reproducible, interdisciplinary research across philology, linguistics, cryptology, and computer science. The paper outlines the data collection and digitization workflow, describes the metadata and database architecture, and demonstrates applications in analysis and decipherment.

Details

Paper ID
lrec2026-ws-lt4hala-17
Pages
pp. 184-196
BibKey
megyesi-etal-2026-building
Editors
Rachele Sprugnoli, Marco Passarotti
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • BM

    Beata Megyesi

  • RR

    Rune Rattenborg

  • BL

    Benedek Láng

  • MW

    Michelle Waldispühl

  • MH

    Mihály Héder

Links