A Diachronic Comparable Corpus of Spanish Digital News (2017–2026) for the Study of Stylistic Convergence in the GenAI Era
Proceedings of the 19th Workshop on Building and Using Comparable Corpora (BUCC)
Abstract
This study introduces a comparable corpus of Spanish digital news (2017–2026) designed to analyze potential linguistic shifts coinciding with the widespread adoption of Generative AI. We propose an analytical framework structured across three levels: lexical statistics, semantic topology, and neural classification. By implementing a protocol of NER-masking, we isolate structural discourse markers from topical content to identify the stylistic patterns of the contemporary period. Our results suggest a measurable structural shift within the analyzed corpus, indicating a trend toward a more standardized professional register. While macro-statistical metrics like Shannon entropy remain stable —indicating statistical consistency— Zipf-Mandelbrot distributions and SVD mapping reveal a concentration of unique vocabulary into more predictable clusters. In this scenario, the 2023–2026 subcorpus exhibits a discernible topological displacement compared to the 2017–2021 baseline. The study identifies a ‘Gray Zone’ where highly structured technical reporting and hybridized production become indistinguishable, suggesting a structural stylistic convergence within this digital environment. These findings provide a methodological baseline for analyzing discursive stabilization in professional domains without assuming definitive authorship.