Can Large Language Models Facilitate Qualitative Political Narrative Analysis?

Proceedings of the Second Workshop on Building Educational Applications Using NLP

Abstract

This study evaluates whether Large Language Models (LLMs) can facilitate qualitative political narrative analysis by comparing outputs from four models—Mistral, Llama, ChatGPT-4o, and DeepSeek—against narrative analyses written by expert scholars. Using European Union State of the Union speeches (2010–2023), we examine migration and solidarity narratives through semantic and lexical similarity metrics alongside systematic validation. The narrative scholars demonstrate strong semantic alignment despite differences in wording, establishing a benchmark for interpretive consistency. Across both topics, the models produce lexical and semantic similarity scores that are broadly comparable to those observed between the scholars themselves, with differences at these levels often marginal. However, similarity metrics do not provide the full picture. Validation reveals model-specific weaknesses that are not captured by lexical or semantic alignment alone, including factual errors, over-structural abstraction, and difficulty engaging less salient narrative threads. These findings demonstrate that LLMs can produce narratives that align closely with human outputs in semantic and lexical similarity, yet these measures alone are insufficient to assess interpretive quality.