Do We Still Need Corpora and Corpus Analysis Platforms? Discourse Analysis in Times of LLMs
Proceedings of Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities (LLMs4SSH) @ LREC 2026
Abstract
Corpus-based discourse analysis investigates the linguistic construction of societally shared knowledge by iterating between quantitative pattern detection and qualitative interpretation in large text collections. Large Language Models (LLMs) promise to lower practical barriers to such work (e.g., natural-language querying, qualitative coding), yet they also introduce risks that are especially consequential in discourse-analytic settings, where fluent summaries can encourage ungrounded interpretation. This position paper argues that integrating LLMs into corpus analysis platforms is appropriate only insofar as it remains compatible with three epistemic premises of corpus research: (1) transparency of the data basis and traceability of analytical operations; (2) interpretability as evidence-constrained sense-making; and (3) seriality and patternedness as distributional structure and variation. In this opinion paper, we contribute a platform-oriented requirements perspective that translates these premises into design constraints for tool-calling/RAG-style integration, and we outline implementation directions that treat LLMs as an interaction layer over inspectable corpus retrieval and platform-based analysis.