So it is my contention that LLMs exhibit behavior far beyond what we could reasonable predict from a next-token-prediction task on its training set. Therefore I don't really like the framing of "this is present in the training data" as a response to LLM capability except in a very narrow sense.
One issue is that we anthropomorphise -- we see training data that, to a human, looks similar to the task at hand, and therefore we say that this task is represented in the training data, despite the fact that in the next-token-prediction sense that reflection does not exist (unless your model for next-token-prediction is as complex as the LLM itself).
My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.
> My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.
I guess it depends on what you mean by "reflecting" the training data. Obviously the apparent knowledge/understanding of the model has come from the training data (no where else for it to come from), so the question is really how to best understand that. Next-token prediction is what the model does, but says nothing about how it does it, and so is not very helpful in setting expectations for what the model will be capable of.
When you look at the transformer model in detail, there are two aspects that really give it it's power.
1) The specific form of the self-attention mechanism, whereby the model learns keys that can be used to look up associated data at arbitrary distances away (not just adjacent words as in a much simpler N-gram language models).
2) The layered architecture whereby levels of representation and meaning can be extracted and build upon lower levels (with this all being accumulated/transformed in the embeddings). This layered architecture was chosen by Jakob Uszkoreit to allow hierarchical parsing similar to that reflected in linguists sentence parse trees.
When we then look at how trained transformers operate - the field of mechanistic interpretability - how they are actually using the architecture - one of the most powerful mechanisms are "induction heads" where the self-attention mechanism of adjacent layers have learned to co-operate to copy data (partial embeddings) from one part of the input to another.
This is "A'B' => AB" copying mechanism is very general, and is where a lot of the predictive/generative power of the trained transformer is coming from.
So, while it's true to say that an LLM (transformer) is "just" doing next token prediction, the depth of representation and representation-transformation that it is able to bring to bear on this task (i.e. has been forced to learn to minimize errors) is significant, which is why some of the things it is capable of seem counter-intuitive if framed just as auto-compete or as a mashup of partial matches from the training set (which is still not a bad mental model).
The way word -> letter sequence generation seems to be working, given that it works on unique made-up nonsense words and not just dictionary ones, is via (induction head) copying of token -> letter sequences. All that is needed is for the model to have learnt the individual token -> sequence associations of each token included in the nonsense word, and it can then use the induction head mechanism to use the tokens of the nonsense word as keys to lookup these associations and copy them to the output.
e.g.
If T1-T3 are tokens, and the training set includes:
T1 T2 -> w i l d c a t, and
T1 T3 -> w i l d f i r e
Then the model (to reduce it's loss when predicting these) will have learnt that T1 -> w i l d, and so when asked to convert a nonsense word containing the token T1 to letters, it can use this association to generate the letter sequence for T1, and so on for the remaining tokens of the word.
The conclusion here seems improbable at best -- if I understand it right, the assumption is that somewhere in the training data is the literal token string (wild)(cat)[other tokens](w)(i)(l)(d)(c)(a)(t)?
Even a transformer trained exclusively on examples of the form (token)(token)(letter-token)(letter-token)...(letter-token) where the letter-tokens are single letters and the tokens represent the standard tokenizer output would have trouble performing this task.
I guess this last statement is testable. I suspect that it would be unsuccessful without vast amounts of training data of this form, and I think we can probably agree that although there may be some, there are not sufficient examples of this form in standard LLM training sets to be able to learn this task specifically; the ability to do this (limited as it is) is an emergent capability of general-purpose LLMs.
1) Novel words are handled because they are just sequences of common tokens
2) Token -> letter sequence associations are either:
a) Deliberately added to the training set, and/or
b) Naturally occurring in the training set, which due to sheer size almost inevitably contains many, many, examples of word to letter sequence associations
Given how models used to fail badly on tasks related to this, and now do much better, it's quite likely that model providers have simply added these to the training set, just as they have added data to improve other benchmark tests.
That said, what I was pointing out is that words are represented as token sequences, so a word spelling sample is effectively a seq-2-seq (tokens to letters) sample, and we'd expect the model (which is built for seq-2-seq!) to be able to easily learn and generalize over these.
One issue is that we anthropomorphise -- we see training data that, to a human, looks similar to the task at hand, and therefore we say that this task is represented in the training data, despite the fact that in the next-token-prediction sense that reflection does not exist (unless your model for next-token-prediction is as complex as the LLM itself).
My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.