Reproducibility under Threat: Proposing a Framework for Reliable LLM-Research in Psychology and Computational Social Science

Proceedings of the Second Workshop on Building Educational Applications Using NLP

Abstract

The integration of artificial intelligence (AI), particularly large language models (LLMs), into research across the social sciences has accelerated innovation but also introduced significant challenges to reproducibility - a cornerstone of scientific integrity. In this review of scientific practices, we examine the reproducibility crisis in AI-driven research with a focus on psychology, identifying common pitfalls, reviewing proposed solutions, and advocating for best practices. Common pitfalls in current practices in the social sciences are identified and highlighted through synthesized research scenarios, such as: (1) using inaccessible datasets or language models with restricted access, (2) treating black-box API outputs as stable observations ignoring updates and hidden changes, (3) producing single runs for measurements instead of stochastic draws for aggregated performances, (4) failing to report full LLM version, prompting, and sampling parameters, and (5) opaque training and fine-tuning of LLMs. Our recommended practices include precisely documenting the model used, fixing all inference parameters, using automation and scripts to control prompts, context, and outputs, and standardizing the environment and API conditions. By embracing transparency and methodological rigour, we can transform the challenges of AI-driven research into opportunities for more robust and impactful science, ensuring that innovation never comes at the cost of credibility.