Construction of Japanese Prefectural Assembly Minutes Datasets across Three Electoral Terms: Comparative Analysis of 2011, 2015, and 2019 Four-Year Periods

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

The presented longitudinal cross-regional corpus of Japanese prefectural assembly minutes spans 12 years (2011-2023) across three electoral terms. The corpus comprises 12,236,974 records containing 743,147,226 characters (471,496,688 tokens) of transcribed remarks from the plenary sessions of all 47 prefectural assemblies in Japan. Each dataset is organized by speaker, with assembly members linked to their electoral information, including gender, age, and electoral district. Through a comparative analysis across the three terms, we documented significant temporal changes. The proportion of members aged 25-44 decreased, whereas female representation increased. Female members use 20-30% more characters per speech than male counterparts across all age groups. The proportion of members who never speak varies from under 2% for younger females to over 10% for males aged 65+. We demonstrate the utility of the corpus through three applications: a quantitative analysis of gender and age patterns in political discourse, AI-driven computational dialectology for extracting regional linguistic features, and a web-based search and visualization system. This longitudinal cross-regional corpus provides a valuable resource for interdisciplinary research on subnational politics, computational linguistics, dialectology, and political communication in non-Western democracies. The datasets are available for research purposes upon request, with public query access provided through a web-based interface.