FPSC: A Sustainable Pipeline for Building a Faroese Parliamentary Speech Corpus
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This work addresses the lack of large-scale, natural speech data for Faroese automatic speech recognition. Existing resources, such as the 100-hour Ravnursson corpus, consist of read speech and do not capture the spontaneous variation, sociolinguistic aspects and prosody of real dialogue, limiting model performance. To overcome this, we present the Faroese Parliament Speech Corpus (FPSC)—a 1,600-hour collection of parliamentary recordings comprising 89,000 speeches with detailed speaker and linguistic metadata. The corpus includes weakly supervised transcriptions generated using an ensemble of four Faroese-adapted ASR models combined through a ROVER-based voting procedure. In creating FPSC, we trained several new state-of-the-art ASR models for Faroese—some built on large-scale pretrained backbones and others leveraging multilingual transfer—all outperforming previously published Faroese ASR systems. FPSC represents the first corpus of natural spoken Faroese and a major step toward realistic ASR modeling for Faroese, offering an open, reproducible, and scalable resource for future speech and language research.