Back to Main Conference 2026
LREC 2026main

PersianAnonymizer: Evaluating LLM-Labeled Training for Efficient NER-based Anonymization in Persian

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/57u2ica9225o

Abstract

We target practical anonymization of Persian customer chats by training a compact NER model from LLM-labeled supervision and selecting the best labeler for deployment. We compare three instruction-tuned LLMs—DEEPSEEKV3-0324, GPT-OSS-120B, and QWEN3-235B-A22B-INSTRUCT-2507—to produce span annotations under a shared JSON protocol, yielding four corpora (OSS_ZeroShot, Qwen_ZeroShot, Qwen_FewShot, DeepSeek_FewShot). A MATINAROBERTA-based token-classifier is trained per corpus and evaluated with token-level Precision/Recall/F1 (overall and per-class). We also report Label Coverage Recall (LCR), the proportion of gold non-O tokens predicted as non-O, and quantify cross-labeler behavior via a token-level Venn on test annotations. Finally, we contrast test-set annotation latency of the LLMs on H200 nodes with the trained NER’s test-time labeling on a single RTX 3090. Results show that supervision from OSS_ZeroShot yields the strongest macro-F1 and LCR, while the resulting NER labels an entire 40K-message test set in ∼2 minutes on one consumer GPU. This establishes a practical path to high-quality, low-cost anonymization for Persian industrial data.

Details

Paper ID
lrec2026-main-352
Pages
pp. 4497-4506
BibKey
shalchian-etal-2026-persiananonymizer
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • MS

    Mohammad Hossein Shalchian

  • MA

    Mostafa Amiri

  • AS

    Amir Mahdi Sadeghzadeh

Links