As a European, I think I mostly liked Costco when I visited. But what I'll always remember is that pizza slice you can get when you leave. The amount of fat and especially salt made me feel like I'm about to have a stroke. I can totally understand how some Americans are unhealthy/obese. It was overall a great experience - 10/10 would do again.
I can't imagine you could buy a pie of that shit to take home.
So the way this works seems to be that you first have an "activation verbalizer" model that generates some tokens describing the activation, and then an "activation reconstructor" that tries to recreate the activation vector. If that reconstruction is close to the original activation vector, they claim, the verbalization probably carries some meaningful information.
I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it. How does the model decide which 'thoughts' to ultimately pay attention to, and prioritize some output token over another?
> I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it.
Yeah, I thought this section in the appendix was particularly interesting:
> We find that NLAs trained at a midpoint layer surface reward-model-sycophancy terms, while NLAs trained at later layers do not. This is consistent with Lindsey et al. [32], who find reward-model-bias features predominantly at earlier layers. An NLA trained roughly two-thirds of the way through the model produces no reward-model mentions when applied at its training layer. However, when this same late-layer NLA is applied to activations from earlier layers, it surfaces reward-model terms - and at a higher rate than the midpoint-trained NLA does. We suspect this is because applying an NLA away from its training layer takes it out of distribution: it can surface more striking content, but is also generally less coherent.
They also mention training NLAs to accept multiple layers of activations as a possible future research direction.
So at the heart of this architecture is what they call 'Markovian RSA', a combination of two papers RSA[0], which generates a certain amount of reasoning traces for a prompt; and the 'Markovian Thinker'[1] which seems to basically cut the end of those traces to keep context at a reasonable length.
I feel like there's potential to improve that part of just cutting a tunable amount (τ) of tokens off the tail end of those traces, because you may potentially lose valuable insight earlier in the trace? They did train the model (in SFT) to put the relevant information into the tail (τ) of the trace, but I'm not sure this is the best possible way.
It sounds like asking CS PhDs to do a world record speed run. I wouldn't be surprised if the people best suited to the task aren't the type to get onto "a vetted list".
reply