Identifying Contexts of Distress in College Students' Reddit Posts: A Comparative Study of Classical NLP and Large Language Models
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Mental health is a salient and growing societal concern among college students. Social media platforms such as Reddit offer a rich source of data regarding how students talk about their mental health, and NLP tools may potentially assist in identifying when a student is struggling. In this paper, we investigate how different NLP tools can be used to extract context surrounding college students expressions of distress. We construct a novel dataset from Reddit posts (College Distress on Reddit, or CDR), and examine the "classical NLP pipeline", and modern generative LLMs on this data. Our dataset exploration is conducted in parallel with, and contrasted against the Dreaddit dataset to examine cross-domain variation. Results show that standard or "classical" NLP tools extract a limited number of concrete entities, whereas generative models can infer more nuanced causes. However, LLMs struggle with knowledge extraction in specific content areas. Our work shows how important it is to be wary of LLMs, especially in mental health contexts.