HomeLREC 2026WorkshopsLEGALlrec2026-ws-legal-11
Back to LEGAL 2026
LREC 2026workshop

Evaluating Encoder- and LLM-Based Approaches for Robust Indirect Personal Identifier Detection

Proceedings of the Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Computational Approaches to Language Data Pseudonymization, Anonymization, De-identification, and Data Privacy (LEGAL2026 and CALD-pseudo 2026) @ LREC 2026

DOI:10.63317/3noxgknxv96m

Abstract

Removing explicit protected health information does not fully eliminate re-identification risk in clinical text. Contextual attributes such as socio-economic status, institutional affiliations or detailed life circumstances may still enable linkage attacks. These heterogeneous and sparsely distributed elements, termed Indirect Personal Identifiers, extend de-identification beyond fixed identifier lists and pose new modeling challenges. Therefore, we present the first systematic comparison of encoder-only models, prompt-based LLMs and hybrid pipelines for span-level IPI detection in English discharge summaries. A fine-tuned RoBERTa-large model improves on an existing baseline and substantially outperforms ChatGPT-5.2, achieving 0.906 micro-F1 and 0.724 macro-F1, compared to 0.509 micro-F1 and 0.487 macro-F1. Our findings indicate that IPI detection constitutes a distinct modeling regime characterized by class imbalance and high intra-class variability, where scaling model capacity alone does not guarantee macro-level robustness. We show that supervised encoder models currently provide the most reliable foundation for extending anonymization guarantees and future research.

Details

Paper ID
lrec2026-ws-legal-11
Pages
pp. 91-101
BibKey
otto-etal-2026-evaluating
Editors
Ingo Siegert, Maria Irena Szawerna, Khalid Choukri, Simon Dobnik, Paweł Kamocki, Therese Lindström Tiedemann, Pierre Lison, Ricardo Muñoz Sánchez, Ildikó Pilán, Lisa Södergård, Kossay Talmoudi, Elena Volodina, Xuan-Son Vu
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Computational Approaches to Language Data Pseudonymization, Anonymization, De-identification, and Data Privacy (LEGAL2026 and CALD-pseudo 2026) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • CO

    Christoph Otto

  • IB

    Ibrahim Baroud

  • AA

    Akiko Aizawa

  • SM

    Sebastian Möller

  • RR

    Roland Roller

  • LR

    Lisa Raithel

Links