The NakbaEcho Dataset: From Oral Testimonies to a Transcribed Arabic History Corpus

Proceedings of the 2nd International Workshop on Nakba Narratives as Language Resources @ LREC 2026

Abstract

We present NakbaEcho, a dataset derived from Palestinian testimonies about the 1948 Nakba. The resource is constructed from transcribing over 2,180 hours of recorded interviews gathered through the Palestine Remembered Oral History index and linked to multiple repositories, including the Palestinian Oral History Archive (POHA) and YouTube-hosted interviews. We harmonize interview-level metadata and generate timestamp-aligned transcripts from the original Arabic recordings using an automatic transcription pipeline configured for Palestinian Arabic. The dataset includes speaker-labeled segments and auxiliary annotations designed to support downstream research in Arabic speech processing, natural language processing, digital humanities, and oral-history analysis. NakbaEcho contributes a structured computational resource for studying Palestinian oral testimony while expanding the availability of dialectal Arabic materials for speech, text, and social research.