WhiteHouse: Translation of the Casablanca Corpus for Multi-dialectal Arabic Speech Translation

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Remarkable progress has been made recently in the speech processing of Arabic dialects. This is primarily due to the availability of large multilingual pre-trained models as well as the development of multiple well-annotated datasets that support training, fine-tuning, and evaluation of various speech models. However, most existing research on Arabic speech processing did not consider Automatic Speech Translation (AST) and focused mainly on Dialect Identification (DI) and Automatic Speech Recognition (ASR) tasks. To address this gap, we introduce WhiteHouse, the first multi-dialectal Arabic-English Speech Translation Corpus. WhiteHouse supplements the recently created Casablanca dataset with English translation for each utterance in the transcripts. This results in a three-way parallel speech-transcription-translation multi-dialectal Arabic dataset. WhiteHouse dataset is used to evaluate various SoTA speech translation models. Our experiments show that SoTA speech translation models performs poorly when evaluated on Arabic dialectal conditions. All the data used during training and testing are released for public use and further improvements