Evaluating Professional Acceptability of LLM-Generated Systematic Review Summaries in Healthcare: Psychiatrists’ Perspectives
Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026
Abstract
Cochrane systematic reviews evaluate the effectiveness and safety of medical interventions. Patients can benefit from clinicians’ integration of outcomes of these reviews into their daily practices. However, systematic reviews are usually long documents; even their abstracts can extend to 1000 words, making rapid appraisal challenging for busy health professionals. Large language models (LLMs) offer potential to further distil these abstracts. Nevertheless, generating high-quality, clinician-oriented summaries in this context is non-trivial. They must comprehensively cover the original abstract, while remaining accurate and professionally acceptable, i.e., retaining all clinically important details. To address this challenge, we have developed a novel dataset, PsycSumEval, comprising summaries generated by four different LLMs for 115 Cochrane abstracts concerning mental health. Psychiatrists evaluated each summary across nine content dimensions, assigning scores and providing free-text justifications that highlight inaccuracies and missing details. The corpus provides fine-grained insight into how psychiatrists assess professional acceptability of compressed medical evidence. Rather than treating agreement as a merely statistical endpoint, we capture structured expert judgments alongside their rationales, enabling transparent analysis of where professional norms are stable and where interpretive latitude persists. We contribute both a rigorous evaluation dataset and an explicit model of expert acceptability criteria for medical evidence summarisation.