LocalGovPL: A Corpus of Speaker-Attributed Polish Local Government Transcripts
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present LocalGovPL, a large-scale, speaker-annotated corpus of Polish local government meeting transcripts processed using an automatic two-stage LLM pipeline. The corpus consists of 31,900 sessions from 749 councils recorded between 2018–2025 (approximately 391M words). It is released in TEI P5 format with explicit links between utterances and registered participants. We collect transcripts from official local government portals using a dedicated crawler, normalize the text, and apply: (1) LLM-assisted extraction of person names and administrative roles; and (2) attribution of utterances to identified speakers using discourse cues. To evaluate attribution quality, we manually annotate 30 sessions and evaluate five LLM configurations using three evaluation protocols with speaker-aware word error rate (sWER). The strongest system, Gemini-2.5-pro, achieves 3.9% sWER for abstract speaker identification, 4.6% for known participants, and 5.9% for end-to-end processing with relaxed name matching. LocalGovPL enables large-scale analysis of local deliberative discourse and supports research on dialogue modeling, summarization, and political text analysis.