Detecting Risky Behavior Related to Alcohol and Drug Use within Adolescents' Private Messenger Conversations
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Alcohol and drug use negatively impact adolescents’ health, making early detection and prevention essential. One promising approach involves analyzing adolescents’ online conversations for signs of substance use. However, current machine learning models for online detection often rely on public data sources that fail to capture the private experiences of adolescents. In this study, we developed a BERT-based machine learning model to automatically identify discussions about alcohol and drug use with high accuracy, leveraging private messenger conversations from adolescents. Our novel dataset comprises 272,465 annotated utterances from a corpus of 1,260,492 utterances in 2,807 chats authored by 2,165 individuals, primarily in Czech. Our best BERT-based machine learning model achieved a solid F₁ score of 0.817, demonstrating the feasibility of addressing this social science task even in low-resource languages like Czech. Additionally, we verified that state-of-the-art generative open-source large language models are equally effective for this task and can be successfully adapted for other languages, including English. We also analyzed misclassified utterances to identify problematic patterns and improve model performance. The resulting models have significant practical implications for parental mediation software and parental control applications. By automating substance use detection and enabling appropriate real-time interventions, these tools can contribute to safeguarding adolescents’ health.