I recall reading recently that someone went back and trained an RNN at a similar scale to a GPT and got similar performance on modern hardware (perhaps someone can link me that paper?).
ie., the innovation in statistical AI isn't in making the algorithms "smarter", it's finding ways to align the computation with modern GPU hardware -- this has been the story since 2012.
In the end, the function all such algs are approximating is a conditional probability. ie., the perfect answer to any prompt is to ignore training entirely, and at inference time, compute an expectation across all historical data. All training does is essentially optimally cache a large part of that computation.
This is very different to how it's typically sold/understood, in the sense that there's an appearance that at inference-time some unbounded computation is going on, ie., "thinking"/"reasoning"/etc. But at inference time for any prompt the same amount of computation is used, regardless of the question complexity. So the system will appear to reason (etc.) if it can sample convincingly from its pre-cached computation.
This means "innovation" here follows a moore's law S-curve for GPU hardware.
> But at inference time for any prompt the same amount of computation is used, regardless of the question complexity.
That's not true and that's why the "think step by step" works. The output becomes part of the context. Forcing that way of answering essentially merges two queries into one: split question into subtasks and solve each subtask separately.
The complex question does cause more computation and provides better quality answers.
Sure, it's not quite true, but it's close enough for the analysis to be correct.
ie., if we take a family of problems: P1..n each statable with prompts P1..n, each having an 'essential complexity to solve' O(P1)...O(Pn) then it's trivial to show that no curve-fitting statistical algorithm has the 'correct' computational budgets associated to each. The budget is basically constant, and is not increasing correctly as the problem complexity increases.
(Eg., consider the prompt: by applying Dijkstra's algorithm, find a path....on graph G1, G2..Gn -- with ever more complex graphs).
It is right to say that step-by-step prompting expands the computational resources available to the LLM and therefore often improves the quality of the answer.. but it is incorrect to say that it does so by allocating computation according to that complexity. At best you could say that it is we, the human prompter, seeing the answer is bad, are increasing its budget -- so we are partially modelling the complexity of the problem for the LLM.
If you could create an LLM that "prompted itself" according to the complexity of the problem, and measured its energy/time/etc. to be as we would expect.. then you'd be on firmer grounds with this claim. But i'd be inclined to deny this was possible given what an LLM is...
ie., I claim that I can always find a family of problems P1..Pn where the computational distribution will rule out that reasoning is taking place. Why? Because the LLM is only sampling from a compression of training data, I deny it is at all sensitive to the meaning of the terms.
So whereas a person will understand a novel problem (family) on the basis of its meaning, and expend time appropriately, the LLM is limited to sampling from prior examples of such problems in training data.
> ie., I claim that I can always find a family of problems P1..Pn where the computational distribution will rule out that reasoning is taking place.
Finding edge cases is not a great way to evaluate probabilistic processes. You'll almost always find a pathological case. I'm sure we can find a series of questions of increasing difficulty where humans think they're of decreasing difficulty. That wouldn't prove humans don't reason.
> If you could create an LLM that "prompted itself" according to the complexity of the problem
You don't have to create one. We know that in-context learning is roughly equivalent to training, so you can add instructions about evaluating task difficulty and different approaches depending on the result to the system prompt. You'll see it's possible to do that estimate. Agents use that all the time to decide when to split a task into subtasks and when to proceed with solving.
> So whereas a person will understand a novel problem (family) on the basis of its meaning, and expend time appropriately, the LLM is limited to sampling from prior examples of such problems in training data.
Can you prove humans aren't just sampling from training data + preserving context? I don't believe anyone has really documented how human thought works at that level. I'm not saying it's true or false, just - do we actually know enough to validate the claim you made?
> Finding edge cases is not a great way to evaluate probabilistic processes.
These aren't "edge cases". They are proofs by contradiction:
Claim 1: LLMs have a fixed computation budget per token
Claim 2: LLMs engage in reasoning
Claim 3: Reasoning is an unbounded process whose time is a function of the complexity of the problem
Claim 4 Therefore: LLMs compute proportional to the complexity of the problem
Problem: Contradiction in Claims 1, 4.
There are an infinite number of "edge" cases to illustrate this point; but the point has nothing to do with LLMs failing on some algorithms. It's not an argument about LLM accuracy.
> Can you prove humans aren't just sampling from training data + preserving context
Yes. Above, Claim 1': Humans do not have a fixed computation budget per token. So there's no contradiction.
It will take you an appropriately long time to say "Yes" or "No" as the task complexity increases. Hence you are actually engaged in reasoning.
Though I don't find your style of argument good-faith, but wish fulfilment and arguments from ignorance. "I do not know anything about human beings, and I will tell you that you do not, so that I can preserve what I wish to be true" -- this is the position of almost all defenders of AI i've encountered, and I find it disagreeably religious.
This falls apart at claim 1, because it's for a token, not the answer. You can start a reasoning process, but I don't know of anyone seriously claiming it happens in a single step.
LLMs have a fixed budget per token, but the result you normally want is multiple tokens long. Engaging in chain-of-thought answer means a variable number of tokens, with the length depending on the complexity of the task. In this thread I mentioned the prompt + generated subtasks, so it's not like we're talking about single-token outputs.
Or in a simple example, the answer to "using step by step reasoning, what's the colour of a white sheet" is significantly shorter than "using step by step reasoning, why is Chewbacca defence a fallacy". The more complex task automatically got more steps and more tokens dedicated to it.
> Humans do not have a fixed computation budget per token.
That's not even wrong. Humans don't operate on tokens. That just doesn't make sense.
At least two strategies of reply: 1) choose problems P1..n such that they have yes/no answers; 2) rephrase my argument to consider max-tokens given any prompt.
But I think you could see these yourself, so I'm at a loss as to what the purpose of this game is.
What is the motivation to insist that animal cognition consists of some process similar to compressing 1PB of books into frequency associations and computing a conditional probability?
Is it some desire to believe one hasnt been duped by the technology? "No no, I am no fool, it really is real!" etc. ? Or wishfullfilment of some other sort? Or just the cynical view that if one is nihilistic about people one is most likely right? (Each, of course, a pseudoscience of its own sort).
I disagree. Ring attention and tree attention are so general that the core ideas are independent of details about modern GPUs. Maybe so for flash attention but not this. I also disagree because these algorithms are fundamentally about enabling long context by distributing across gpus and this would not be enabled by “moored law for gpu hardware”
If less training performs better, it's because it's overfitting. In the case of NLP/CV you, in general, cannot really "overfit" because most of the goal isnt predictive, but rather representational -- so training=templating, rather than predicting. In these cases training for longer should mostly improve the quality of the output, since there isnt any ground truth to overfit against.
ie., the innovation in statistical AI isn't in making the algorithms "smarter", it's finding ways to align the computation with modern GPU hardware -- this has been the story since 2012.
In the end, the function all such algs are approximating is a conditional probability. ie., the perfect answer to any prompt is to ignore training entirely, and at inference time, compute an expectation across all historical data. All training does is essentially optimally cache a large part of that computation.
This is very different to how it's typically sold/understood, in the sense that there's an appearance that at inference-time some unbounded computation is going on, ie., "thinking"/"reasoning"/etc. But at inference time for any prompt the same amount of computation is used, regardless of the question complexity. So the system will appear to reason (etc.) if it can sample convincingly from its pre-cached computation.
This means "innovation" here follows a moore's law S-curve for GPU hardware.