Modeling the Language of Holocaust Survivors’ Testimony with Domain-Adapted Transformers

Proceedings of The Second Workshop on Holocaust Testimonies as Language Resources (HTRes)

Abstract

Documents related to the Holocaust increasingly move into the focus of Natural Language Processing research, including the digitization of written text, the automatic transcription of oral archives, and interpretive downstream tasks such as Named Entity Recognition. However, most modern language models are trained primarily on modern text, and thus struggle with historical language, historical entities, and domain-specific terminology. Furthermore, transcribed speech introduces challenges such as transcription errors, noise, filler words, and dialectal speech not often contained in textual datasets. We present XLM-RoBERTa-malach, a text encoder domain-adapted to oral testimonies of Holocaust survivors in seven languages. In addition to descriptions of the data acquisition via Automatic Speech Recognition, data augmentation via Machine Translation, and the continued pretraining of a state-of-the-art multilingual transformer, we evaluate the domain-adapted model on the Named Entity Recognition task. Experiments on this task show superior performance over the general-domain transformer in a multilingual domain-specific setting, including languages not seen during the domain adaptation.