I think the answer is actually quite clear and rather boring. In order to get so...

andrewla · on Sept 10, 2024

This is a very post hoc explanation. What does "coverage in the training data" mean?

Take a simple task of something like "How many a's are there in the word bookkeeper" -- what is your theory for why it can answer this question correctly or even give something approaching a coherent answer? It never even sees the letters that are in the token "bookkeeper", and this is definitely not something that appears explicitly in the training data.

I challenge you to give a "clear and boring" explanation for this -- this is incredibly subtle behavior that emerges from a complex architecture and complex training process, and is in its own right as fascinating and mysterious as the ability of humans to do this task and the inability of cats to do it.

HarHarVeryFunny · on Sept 11, 2024

Try asking Claude Sonnet 3.5 (one of today's best models)

"how many p's in Lypophrenia - just a number please"

I tried it a second ago, and it said "1".

To get these correct requires splitting tokens into letters and counting. I'd not be surprised if most models are either trained on token splitting or have learnt to do it. "Counting" number of occurrences of letters in an arbitrary separated sequence is the harder part, and where I'd guess it might be failing.

andrewla · on Sept 11, 2024

Yes! They are commonly wrong in this, and that's fascinating too. Because they are not solving the problem by looking at the letter in the word because they are not architecturally capable of enumerating the letters in the word. The fact that they can do it at all could be the stuff of an entire phd thesis, and could tell us more about the nature of LLM hallucination than a bunch of rambling about "how much coverage in the training set" when our determination of the coverage is based on human semantic similarity.

HarHarVeryFunny · on Sept 11, 2024

Ability to split words into letters isn't architecturally limited - it's just a matter of training data, and made easier by the fact that the input is tokens representing short letter sequences rather than words of which there are more.

It's quite possible that more recent training data deliberately includes word/token -> letter sequence samples, but even if not I'd expect there is going to be enough spelling examples naturally occurring in the training data for the model to learn the token (not word) -> letter sequence rules (which will be consistent/reinforced across all spelling samples), which it can then apply to arbitrary words.

andrewla · on Sept 11, 2024

So it is my contention that LLMs exhibit behavior far beyond what we could reasonable predict from a next-token-prediction task on its training set. Therefore I don't really like the framing of "this is present in the training data" as a response to LLM capability except in a very narrow sense.

One issue is that we anthropomorphise -- we see training data that, to a human, looks similar to the task at hand, and therefore we say that this task is represented in the training data, despite the fact that in the next-token-prediction sense that reflection does not exist (unless your model for next-token-prediction is as complex as the LLM itself).

My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.

HarHarVeryFunny · on Sept 11, 2024

> My question to you -- what would falsify your belief that the LLMs just reflect tasks from the training set? Or at least, what would reduce your confidence in this? The letter sequence stuff for me seems like pretty clear evidence against.

I guess it depends on what you mean by "reflecting" the training data. Obviously the apparent knowledge/understanding of the model has come from the training data (no where else for it to come from), so the question is really how to best understand that. Next-token prediction is what the model does, but says nothing about how it does it, and so is not very helpful in setting expectations for what the model will be capable of.

When you look at the transformer model in detail, there are two aspects that really give it it's power.

1) The specific form of the self-attention mechanism, whereby the model learns keys that can be used to look up associated data at arbitrary distances away (not just adjacent words as in a much simpler N-gram language models).

2) The layered architecture whereby levels of representation and meaning can be extracted and build upon lower levels (with this all being accumulated/transformed in the embeddings). This layered architecture was chosen by Jakob Uszkoreit to allow hierarchical parsing similar to that reflected in linguists sentence parse trees.

When we then look at how trained transformers operate - the field of mechanistic interpretability - how they are actually using the architecture - one of the most powerful mechanisms are "induction heads" where the self-attention mechanism of adjacent layers have learned to co-operate to copy data (partial embeddings) from one part of the input to another.

https://transformer-circuits.pub/2022/in-context-learning-an...

This is "A'B' => AB" copying mechanism is very general, and is where a lot of the predictive/generative power of the trained transformer is coming from.

So, while it's true to say that an LLM (transformer) is "just" doing next token prediction, the depth of representation and representation-transformation that it is able to bring to bear on this task (i.e. has been forced to learn to minimize errors) is significant, which is why some of the things it is capable of seem counter-intuitive if framed just as auto-compete or as a mashup of partial matches from the training set (which is still not a bad mental model).

The way word -> letter sequence generation seems to be working, given that it works on unique made-up nonsense words and not just dictionary ones, is via (induction head) copying of token -> letter sequences. All that is needed is for the model to have learnt the individual token -> sequence associations of each token included in the nonsense word, and it can then use the induction head mechanism to use the tokens of the nonsense word as keys to lookup these associations and copy them to the output.

e.g.

If T1-T3 are tokens, and the training set includes:

T1 T2 -> w i l d c a t, and T1 T3 -> w i l d f i r e

Then the model (to reduce it's loss when predicting these) will have learnt that T1 -> w i l d, and so when asked to convert a nonsense word containing the token T1 to letters, it can use this association to generate the letter sequence for T1, and so on for the remaining tokens of the word.

andrewla · on Sept 11, 2024

The conclusion here seems improbable at best -- if I understand it right, the assumption is that somewhere in the training data is the literal token string (wild)(cat)[other tokens](w)(i)(l)(d)(c)(a)(t)?

Even a transformer trained exclusively on examples of the form (token)(token)(letter-token)(letter-token)...(letter-token) where the letter-tokens are single letters and the tokens represent the standard tokenizer output would have trouble performing this task.

I guess this last statement is testable. I suspect that it would be unsuccessful without vast amounts of training data of this form, and I think we can probably agree that although there may be some, there are not sufficient examples of this form in standard LLM training sets to be able to learn this task specifically; the ability to do this (limited as it is) is an emergent capability of general-purpose LLMs.

HarHarVeryFunny · on Sept 11, 2024

What I'm saying is that:

1) Novel words are handled because they are just sequences of common tokens

2) Token -> letter sequence associations are either:

a) Deliberately added to the training set, and/or

b) Naturally occurring in the training set, which due to sheer size almost inevitably contains many, many, examples of word to letter sequence associations

Given how models used to fail badly on tasks related to this, and now do much better, it's quite likely that model providers have simply added these to the training set, just as they have added data to improve other benchmark tests.

That said, what I was pointing out is that words are represented as token sequences, so a word spelling sample is effectively a seq-2-seq (tokens to letters) sample, and we'd expect the model (which is built for seq-2-seq!) to be able to easily learn and generalize over these.

tgbugs · on Sept 10, 2024

Are you surprised that jpg compression algorithms can reproduce input data that bears striking resemblance to the uncompressed input image across a variety of compression levels?