HomeLREC 2026WorkshopsHTRESlrec2026-ws-htres-07
Back to HTRES 2026
LREC 2026workshop

From Oral History to Structured Data: The MalachNER Dataset

Proceedings of The Second Workshop on Holocaust Testimonies as Language Resources (HTRes)

DOI:10.63317/3zyb8y48bdnk

Abstract

We present MalachNER, a new multilingual dataset for Named Entity Recognition (NER) in testimonies of Holocaust survivors. MalachNER has been sourced from different archives and annotated based on comprehensive domain-specific guidelines refined by a collaboration of international experts. Covering 10 European languages, differs significantly from previously released datasets: It is primarily based on noisy, verbatim transcribed speech, rather than on digitized written documents. These transcripts are characterized, among other challenges, by fillers, dialectal speech, and in-line annotations indicating incomprehensible words, which are not commonly encountered in other datasets. However, large volumes of yet unprocessed oral history make such a dataset a necessity. In addition to the description of the dataset and its annotation guidelines, we show with baseline experiments that MalachNER is complementary with previously released data, and the key to training domain-specific language models that generalize well to written and oral testimony alike, achieving state-of-the-art performance on both types of documents.

Details

Paper ID
lrec2026-ws-htres-07
Pages
pp. 59-65
BibKey
brckner-etal-2026-oral
Editors
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of The Second Workshop on Holocaust Testimonies as Language Resources (HTRes)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • CB

    Christopher Brückner

  • KR

    Karin Roginer Hofmeister

  • JK

    Jiří Kocián

  • PP

    Pavel Pecina

Links