HomeLREC 2026WorkshopsLT4HALAlrec2026-ws-lt4hala-42
Back to LT4HALA 2026
LREC 2026workshop

Cost-Aware Pre-Annotation Strategies for Nested NER in Historical Latin Notarial Deeds

Proceedings of the Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026

DOI:10.63317/3i3oqfx2dqoe

Abstract

Manual annotation for Named Entity Recognition in historical documents remains expensive and time-consuming, particularly for complex nested entity structures in domain-specific texts such as Latin notarial deeds. Active learning frameworks like the Humanities Entity Recognizer (HER) reduce annotation requirements by iteratively selecting informative samples for expert annotation, but existing sentence-based sampling strategies create unpredictable annotation costs when sentence lengths vary dramatically. We extend the HER to support nested entities through composite BIO label encoding and introduce token-budgeted sample selection to address annotation cost variability. Under token-budgeting, each annotation iteration targets a fixed token budget rather than a fixed sentence count, while Active Curriculum Learning ensures diverse sentence length representation in initial samples. Experiments on seventeenth-century Latin notarial deeds from Malta’s Notarial Registers Archive demonstrate that token-budgeted sampling achieves comparable macro-span F1 to sentence-based sampling while exhibiting more stable learning trajectories across iterations. Additional experiments examining entity-level performance reveal systematic variation by semantic granularity, with higher-level categorical entities achieving stronger recognition than role-based middle-level entities, which depend on discourse context. Our results demonstrate that controlling sample selection at the token level rather than sentence level provides more predictable annotation planning for active learning in historical document corpora with heavy-tailed sentence length distributions.

Details

Paper ID
lrec2026-ws-lt4hala-42
Pages
pp. 407-417
BibKey
ellul-etal-2026-cost
Editors
Rachele Sprugnoli, Marco Passarotti
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • CE

    Charlene Ellul

  • VB

    Vanessa Buhagiar

  • CB

    Claudia Borg

  • CA

    Charlie Abela

Links