-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions
-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added
-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection
-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random
Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.
Persona’s is not the same thing as a role. The point of the role is to limit what the work of the agent, and to focus it on one or two behaviors.
What the paper is really addressing is does key words like you are a helpful assistant give better results.
The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.
I would be interested in an eval that checked both conditions: you are an amazing x Vs. you are a terrible x. also there have been a bunch of papers recently looking at whether threatening the llm improves output, would like to see a variation that tries carrot and stick as well.
Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.
Fair point on the date - the paper was updated October 2024 with Llama-3 and Qwen2.5 (up to 72B), same findings. The v1 to v3 revision is interesting. They initially found personas helped, then reversed their conclusion after expanding to more models.
"Comprehensively disproven" was too strong - should have said "evidence suggests the effect is largely random." There's also Gupta et al. 2024 (arxiv.org/abs/2408.08631) with similar findings if you want more data points.
A paper’s date does not invalidate its method. Findings stay useful only when you can re-run the same protocol on newer models and report deltas. Treat conclusions as conditional on the frozen tasks, criteria, and measurement, then update with replication, not rhetoric.
https://arxiv.org/abs/2311.10054
Key findings:
-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions
-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added
-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection
-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random
Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.