HomeLREC 2022WorkshopsLEGALlrec2022-ws-legal-12
Back to LEGAL 2022
LREC 2022workshop

MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents

Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference

DOI:10.63317/3o8okt9m7r2n

Abstract

This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.

Details

Paper ID
lrec2022-ws-legal-12
Pages
pp. 64-72
BibKey
arranz-etal-2022-mapa
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
Location
undefined, undefined
Date
20 June 2022 25 June 2022

Authors

  • VA

    Victoria Arranz

  • KC

    Khalid Choukri

  • MC

    Montse Cuadros

  • AG

    Aitor García Pablos

  • LG

    Lucie Gianola

  • CG

    Cyril Grouin

  • MH

    Manuel Herranz

  • PP

    Patrick Paroubek

  • PZ

    Pierre Zweigenbaum

Links