A Corpus-Based Profiling of Regional English Variants in Global Media: Insights from Olympic Journalism
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This paper investigates the distinctive linguistic characteristics of regional English variants through a quantitative analysis of global media coverage. The study applies advanced classification techniques, integrating GPT-based embeddings with Support Vector Machines, to a novel corpus, the Olympic Journalism English Variants Corpus. Comprising news articles related to Olympic Games covered by prominent news outlets in the United States, China, Spain, and Mexico between 2020 and 2023, this corpus enables a fine-grained analysis of 164 linguistic features across lexical, syntactic, readability, and sentiment dimensions. The findings reveal strong and interpretable distinctions in features such as verb ratio, nominality, and readability. This study not only demonstrated the enhanced classification capabilities of the model (optimized F1 score = 97.2), but also yielded deeper, data-driven stylistic analysis and insights of each English variant. This work provides a potential template that can be expanded to other World Englishes research.