HomeLREC 2026WorkshopsDIALRESlrec2026-ws-dialres-26
Back to DIALRES 2026
LREC 2026workshop

HeptaTAX: A Neuro-Symbolic Pipeline and Benchmark for Classifying 16th-Century Heptanesian Notarial Acts

Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective

DOI:10.63317/2sh2japrnd73

Abstract

This study originates in the investigation of lexical bundles and formulaic language within sixteenth-century Corfiot notarial documents. The observed functional variation across identical formulaic sequences motivated the development of a document classification framework designed to support the structural interpretation of such language. Given that 16th-century Corfiot notarial acts represent a rich, albeit understudied, dialectal resource, their systematic categorization into subgenres is essential for their full exploration. However, this task requires substantial manual work, while NLP tools for this task and dialect do not exist. In this paper, we attempt to take an initial step in this direction. First, we present a corpus of 1,088 notarial acts from 5 notaries spanning 1500-1567, a 3-tier annotation schema (17 core genres, extension subcategories, hybrid cross-cutting tags), and a 40-act benchmark with gold annotations at all three tiers. Then, we evaluate 12 LLMs across 4 architectures, zero-shot, few-shot, full-context and Neuro-Symbolic. For the latter, we introduce a symbolic engine comprising a set of deterministic rules for identifying discriminative legal formulae, whose output is then injected into the neural (LLM) engine. The results show that the NeSy architecture compresses the accuracy gap between stronger and weaker models from 47.5 pp to 12.5 pp, with the smallest model (Llama 3.1 8B) gaining 47.5% and matching frontier models that operate without symbolic support. Three models reach a ceiling of 72.5% on the core tier. However, consistent errors in procedurally dense material reveal the limits of lexical and formulaic cues for identifying legal effect, motivating the use of symbolic signals in the NeSy pipeline. Extension and hybrid classification remain open challenges, with best scores of ∼63% and ∼35% respectively.

Details

Paper ID
lrec2026-ws-dialres-26
Pages
pp. 265-273
BibKey
chatzikyriakidis-etal-2026-heptatax
Editors
Antonis Anastasopoulos, Stella Markantonatou, Angela Ralli, Marcos Zampieri, Stavros Bompolas, Vivian Stamou
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the First Workshop on Dialects in NLP — A Resource Perspective
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • SC

    Stergios Chatzikyriakidis

  • EK

    Eleni Karantzola

  • VM

    Vasiliki Makri

Links