Introducing a Green Leaderboard for Sustainable Risk Prediction in Streaming NLP Shared Tasks.

Proceedings of the 2nd Workshop on Ecology, Environment, and Natural Language Processing

Abstract

Current NLP shared-task evaluations predominantly rank systems by predictive performance, overlooking computational efficiency and environmental impact. This limitation is particularly critical in streaming and early risk detection scenarios, where models operate continuously, and resource consumption accumulates over time. We propose a sustainability-aware evaluation framework for streaming NLP tasks by introducing the Green Early Detection Score (GED), which integrates classification performance, detection timeliness, and carbon emissions. We also present an energy-based variant tailored to on-device early risk detection settings where energy consumption per inference is a key constraint. Applying these metrics to three editions (2023-2025) of the MentalRiskES shared task, we construct the first Green Leaderboard for early risk detection. Our results show that sustainability-aware ranking substantially reshapes system positions, highlighting efficient models that remain undervalued under performance-only evaluation.