Useful to Whom? A Persona-Driven Evaluation of Knowledge-Adapted Health Question Reformulation via LLM Simulation
Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026
Abstract
Automatic metrics such as F1 and BERTScore are often insufficient for evaluating user-centric generative tasks like Consumer Health Question (CHQ) reformulation. A high F1-score may not correlate with user satisfaction, especially when the user’s knowledge level (UKL) dictates their needs. We propose a robust, Persona-Driven Evaluation Framework (PDEF), grounded in cognitive science and health literacy literature, to measure persona-specific utility. This framework assesses reformulations from the perspectives of a ‘Layperson’ (requiring foundational context) and an ‘Expert’ (requiring efficient, precise answers). We apply this framework to a set of reformulated questions generated by LLMs, and test the robustness of our evaluation by using three state-of-the-art LLMs (GPT-4o, Llama 3.3, and Mistral Large) as the evaluators. Our results reveal a significant disconnect between automatic metrics and user-perceived quality: the model with the highest F1-score (0.6134) was consistently outperformed in user preference by a Pipelined model, with experts preferring the latter by a statistically significant margin (p < 0.001). Furthermore, our persona-driven ablation analysis provides robust evidence that specific architectural components, specifically UKL inference and Entailment logic, are linked to significant gains in persona-driven utility for Layperson cohorts. This work demonstrates the critical need for user-centric evaluation and shows that its findings are generalizable across different LLM architectures.