Destructotor's comments

Destructotor · 2026-05-08T15:54:53 1778255693

I'm not sure the cause was really similar. In the case of language switching, it was caused by malformed supervised training data where the prompt was translated, but the answer was kept in the original language. In the case of goblins, it was due to a biased RL reward model.

Destructotor · 2026-05-08T15:37:59 1778254679

> I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it.

Yeah, I thought this section in the appendix was particularly interesting:

> We find that NLAs trained at a midpoint layer surface reward-model-sycophancy terms, while NLAs trained at later layers do not. This is consistent with Lindsey et al. [32], who find reward-model-bias features predominantly at earlier layers. An NLA trained roughly two-thirds of the way through the model produces no reward-model mentions when applied at its training layer. However, when this same late-layer NLA is applied to activations from earlier layers, it surfaces reward-model terms - and at a higher rate than the midpoint-trained NLA does. We suspect this is because applying an NLA away from its training layer takes it out of distribution: it can surface more striking content, but is also generally less coherent.

They also mention training NLAs to accept multiple layers of activations as a possible future research direction.