HomeLREC 2026WorkshopsHTRESlrec2026-ws-htres-09
Back to HTRES 2026
LREC 2026workshop

Modeling the Language of Holocaust Survivors’ Testimony with Domain-Adapted Transformers

Proceedings of The Second Workshop on Holocaust Testimonies as Language Resources (HTRes)

DOI:10.63317/2cvjzqrxsjks

Abstract

Documents related to the Holocaust increasingly move into the focus of Natural Language Processing research, including the digitization of written text, the automatic transcription of oral archives, and interpretive downstream tasks such as Named Entity Recognition. However, most modern language models are trained primarily on modern text, and thus struggle with historical language, historical entities, and domain-specific terminology. Furthermore, transcribed speech introduces challenges such as transcription errors, noise, filler words, and dialectal speech not often contained in textual datasets. We present XLM-RoBERTa-malach, a text encoder domain-adapted to oral testimonies of Holocaust survivors in seven languages. In addition to descriptions of the data acquisition via Automatic Speech Recognition, data augmentation via Machine Translation, and the continued pretraining of a state-of-the-art multilingual transformer, we evaluate the domain-adapted model on the Named Entity Recognition task. Experiments on this task show superior performance over the general-domain transformer in a multilingual domain-specific setting, including languages not seen during the domain adaptation.

Details

Paper ID
lrec2026-ws-htres-09
Pages
pp. 74-83
BibKey
brckner-etal-2026-modeling
Editors
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of The Second Workshop on Holocaust Testimonies as Language Resources (HTRes)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • CB

    Christopher Brückner

  • JL

    Jan Lehečka

  • Jan Švec

  • PP

    Pavel Pecina

Links