From Oral History to Structured Data: The MalachNER Dataset

Proceedings of The Second Workshop on Holocaust Testimonies as Language Resources (HTRes)

Abstract

We present MalachNER, a new multilingual dataset for Named Entity Recognition (NER) in testimonies of Holocaust survivors. MalachNER has been sourced from different archives and annotated based on comprehensive domain-specific guidelines refined by a collaboration of international experts. Covering 10 European languages, differs significantly from previously released datasets: It is primarily based on noisy, verbatim transcribed speech, rather than on digitized written documents. These transcripts are characterized, among other challenges, by fillers, dialectal speech, and in-line annotations indicating incomprehensible words, which are not commonly encountered in other datasets. However, large volumes of yet unprocessed oral history make such a dataset a necessity. In addition to the description of the dataset and its annotation guidelines, we show with baseline experiments that MalachNER is complementary with previously released data, and the key to training domain-specific language models that generalize well to written and oral testimony alike, achieving state-of-the-art performance on both types of documents.