Voices across Decades: A Multimodal Diachronic Corpus of German Bundestag Debates (GerParlDia-MM)
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This paper presents a multimodal diachronic corpus of German parliamentary debates spanning 1949 – 2025. The dataset focuses on speakers with exceptionally long political careers in the Bundestag, covering at least six parliamentary terms for female and eight for male members, comprising 75 individuals (43 men/32 female) and 2,136 speeches. The corpus integrates audio, video (when available), and official transcripts, enriched with metadata on date, party affiliation, and legislative term. Transcripts were temporally aligned with parliamentary media recordings, and non-speech segments were automatically removed. The corpus enables research on voice aging, intra-speaker variability, and longitudinal political language, and supports benchmarking of ASR and speaker recognition across decades. Thus, this corpus bridges the gap between short-term speech corpora and single-speaker longitudinal datasets, offering a unique foundation for studying change in voice, style, and rhetoric over more than seventy years of German parliamentary history.