Ecological Discourse Modeling in a Low-Resource Setting: A Longitudinal Vietnamese Climate Corpus with Comparative Topic Modeling
Proceedings of the 2nd Workshop on Ecology, Environment, and Natural Language Processing
Abstract
Climate change discourse has expanded substantially in recent decades, yet computational analyses remain concentrated on high-resource languages. In this paper, we construct a longitudinal Vietnamese climate news corpus and examine thematic structure and temporal evolution in a lower-resource setting. The corpus comprises 10,401 articles published between 2004 and 2026 and is systematically preprocessed using linguistically informed word segmentation. To ensure domestic relevance, we apply transformer-based Named Entity Recognition and construct a geographically grounded subset of 4,501 Vietnam-focused documents. We analyze this dataset using both Latent Dirichlet Allocation and BERTopic. Results reveal stable thematic dimensions alongside longitudinal shifts from event-driven pollution reporting toward governance- and energy-centered narratives. Embedding-based modeling achieves higher semantic coherence while maintaining comparable topic diversity. The main contribution of this work is thus the compilation of a structured Vietnamese climate corpus and a systematic analysis of discourse evolution in an underrepresented language context.