This House Debates AI: Evaluating a Language Model in Oxford-Style Debates against Human Experts
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Recent work shows that large language models (LLMs) are increasingly capable of generating persuasive arguments and messages, creating concerns over undue influence on human beliefs. Most evidence so far, however, evaluates LLM argumentation and persuasion in single-turn interactions and/or compares to weak human baselines. To address this gap, we benchmark a state-of-the-art LLM, Llama 3.1 Instruct 405B, in 100 six-turn Oxford-style debates against 20 experienced human debaters. Each anonymised debate is rated by 5 independent raters, who provide win/loss judgments as well as 0–100 scores across 11 dimensions of quality. Based on these ratings, the LLM is competitive overall, with a win rate of 51.2%, ranking 6th out of 21 debaters on mean performance score. Compared to humans, the LLM generally scores higher on presentational dimensions (e.g., clarity, confidence, formality) but equal on most substantive dimensions (convincingness, evidence, originality). We also find that pre/post rater stance tends to shift towards the position raters chose as the winning side, regardless of whether this side was the LLM or a human. Overall, our results provide new evidence on the qualities of LLM argumentation and its drivers, suggesting strong argumentative competence even in competitive multi-turn settings.