Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This has been pretty comprehensively disproven:

https://arxiv.org/abs/2311.10054

Key findings:

-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions

-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added

-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection

-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random

Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.





Persona’s is not the same thing as a role. The point of the role is to limit what the work of the agent, and to focus it on one or two behaviors.

What the paper is really addressing is does key words like you are a helpful assistant give better results.

The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.


Aside from what you said about applicability, the paper actually contradicts their claim!

In the domain alignment section:

> The coefficient for “in-domain” is 0.004(p < 0.01), suggesting that in-domain roles generally lead to better performance than out-domain roles.

Although the effect size is small, why would you not take advantage of it.


I would be interested in an eval that checked both conditions: you are an amazing x Vs. you are a terrible x. also there have been a bunch of papers recently looking at whether threatening the llm improves output, would like to see a variation that tries carrot and stick as well.

How well does such llm research hold up as new models are released?

Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.

In a discussion about LLMs you link to a paper from 2023, when not even GPT-4 was available?

And then you say:

> comprehensively disproven

? I don't think you understand the scientific method


Fair point on the date - the paper was updated October 2024 with Llama-3 and Qwen2.5 (up to 72B), same findings. The v1 to v3 revision is interesting. They initially found personas helped, then reversed their conclusion after expanding to more models.

"Comprehensively disproven" was too strong - should have said "evidence suggests the effect is largely random." There's also Gupta et al. 2024 (arxiv.org/abs/2408.08631) with similar findings if you want more data points.


A paper’s date does not invalidate its method. Findings stay useful only when you can re-run the same protocol on newer models and report deltas. Treat conclusions as conditional on the frozen tasks, criteria, and measurement, then update with replication, not rhetoric.

...or even how fast technology is evolving in this field.

One study has “comprehensively disproven” something for you? You must be getting misled left right and centre if that’s how you absorb study results.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: