Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Can large language models reliably express a human-like personality, or are they merely mimicking surface cues without a stable underlying profile? We study this question on the long-form Essays Dataset, preferred over short, mood-driven text to target stable traits. Using a questionnaire-based (self-evaluation) test: IPIP-NEO, we ask: (i) does post-training (SFT, DPO, ORPO) stabilize questionnaire scores under prompt rephrasings, and (ii) can it induce target Big Five profiles from unguided essays? Across five models, fine-tuning consistently reduces variance in questionnaire responses, mitigating the fragility seen in pre-trained models. Yet accuracy on the full five-dimensional profile remains near chance even when single-trait scores improve, indicating that unguided essays lack the cues needed for faithful personality expression. We argue for scenario-grounded datasets or interactive elicitation that accumulates test-aligned evidence over time.