Back to Main Conference 2026
LREC 2026main

The Potential for Misleading Results in Text Sanitisation with Standard Evaluation Metrics

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4ubbuzpc4hpu

Abstract

Data privacy is an important facet of modern life. It is especially important when considering data that carries potentially sensitive information such as in medical or legal documents. However, it is particularly difficult to ensure private information has been removed or masked in unstructured data, e.g. free-flowing text. The evaluation of systems that automatically detect and remove personal identifiable information (PII) from text is also challenging. Here we present a case study of a system that seemingly performed well, but under closer scrutiny the high performance was due to the shortcomings of standard binary classification metrics in the context of high target class prevalence. We then give a short analysis of different possible metrics in these high-prevalence scenarios, clearly showing the superiority of the Matthews Correlation Coefficient. This is particularly important because readily available data in this domain is rare and often systems are compared using biographies from Wikipedia which have a naturally high prevalence. This can be further aggravated by certain reasonable pre-processing or evaluation formalisms as in the case study discussed here.

Details

Paper ID
lrec2026-main-364
Pages
pp. 4638-4646
BibKey
zhang-etal-2026-potential
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • DZ

    Dan Zhang

  • MA

    Mark Anderson

Links