Tree Attention: Topology-Aware Decoding for Long-Context

mjburgess · on Aug 12, 2024

I recall reading recently that someone went back and trained an RNN at a similar scale to a GPT and got similar performance on modern hardware (perhaps someone can link me that paper?).

ie., the innovation in statistical AI isn't in making the algorithms "smarter", it's finding ways to align the computation with modern GPU hardware -- this has been the story since 2012.

In the end, the function all such algs are approximating is a conditional probability. ie., the perfect answer to any prompt is to ignore training entirely, and at inference time, compute an expectation across all historical data. All training does is essentially optimally cache a large part of that computation.

This is very different to how it's typically sold/understood, in the sense that there's an appearance that at inference-time some unbounded computation is going on, ie., "thinking"/"reasoning"/etc. But at inference time for any prompt the same amount of computation is used, regardless of the question complexity. So the system will appear to reason (etc.) if it can sample convincingly from its pre-cached computation.

This means "innovation" here follows a moore's law S-curve for GPU hardware.

viraptor · on Aug 12, 2024

> But at inference time for any prompt the same amount of computation is used, regardless of the question complexity.

That's not true and that's why the "think step by step" works. The output becomes part of the context. Forcing that way of answering essentially merges two queries into one: split question into subtasks and solve each subtask separately.

The complex question does cause more computation and provides better quality answers.

mjburgess · on Aug 12, 2024

Sure, it's not quite true, but it's close enough for the analysis to be correct.

ie., if we take a family of problems: P1..n each statable with prompts P1..n, each having an 'essential complexity to solve' O(P1)...O(Pn) then it's trivial to show that no curve-fitting statistical algorithm has the 'correct' computational budgets associated to each. The budget is basically constant, and is not increasing correctly as the problem complexity increases.

(Eg., consider the prompt: by applying Dijkstra's algorithm, find a path....on graph G1, G2..Gn -- with ever more complex graphs).

It is right to say that step-by-step prompting expands the computational resources available to the LLM and therefore often improves the quality of the answer.. but it is incorrect to say that it does so by allocating computation according to that complexity. At best you could say that it is we, the human prompter, seeing the answer is bad, are increasing its budget -- so we are partially modelling the complexity of the problem for the LLM.

If you could create an LLM that "prompted itself" according to the complexity of the problem, and measured its energy/time/etc. to be as we would expect.. then you'd be on firmer grounds with this claim. But i'd be inclined to deny this was possible given what an LLM is...

ie., I claim that I can always find a family of problems P1..Pn where the computational distribution will rule out that reasoning is taking place. Why? Because the LLM is only sampling from a compression of training data, I deny it is at all sensitive to the meaning of the terms.

So whereas a person will understand a novel problem (family) on the basis of its meaning, and expend time appropriately, the LLM is limited to sampling from prior examples of such problems in training data.

viraptor · on Aug 12, 2024

> ie., I claim that I can always find a family of problems P1..Pn where the computational distribution will rule out that reasoning is taking place.

Finding edge cases is not a great way to evaluate probabilistic processes. You'll almost always find a pathological case. I'm sure we can find a series of questions of increasing difficulty where humans think they're of decreasing difficulty. That wouldn't prove humans don't reason.

> If you could create an LLM that "prompted itself" according to the complexity of the problem

You don't have to create one. We know that in-context learning is roughly equivalent to training, so you can add instructions about evaluating task difficulty and different approaches depending on the result to the system prompt. You'll see it's possible to do that estimate. Agents use that all the time to decide when to split a task into subtasks and when to proceed with solving.

> So whereas a person will understand a novel problem (family) on the basis of its meaning, and expend time appropriately, the LLM is limited to sampling from prior examples of such problems in training data.

Can you prove humans aren't just sampling from training data + preserving context? I don't believe anyone has really documented how human thought works at that level. I'm not saying it's true or false, just - do we actually know enough to validate the claim you made?

mjburgess · on Aug 13, 2024

> Finding edge cases is not a great way to evaluate probabilistic processes.

These aren't "edge cases". They are proofs by contradiction:

    Claim 1: LLMs have a fixed computation budget per token
    Claim 2: LLMs engage in reasoning 
    Claim 3: Reasoning is an unbounded process whose time is a function of the complexity of the problem
    Claim 4 Therefore: LLMs compute proportional to the complexity of the problem

    Problem: Contradiction in Claims 1, 4.

There are an infinite number of "edge" cases to illustrate this point; but the point has nothing to do with LLMs failing on some algorithms. It's not an argument about LLM accuracy.

> Can you prove humans aren't just sampling from training data + preserving context

Yes. Above, Claim 1': Humans do not have a fixed computation budget per token. So there's no contradiction.

It will take you an appropriately long time to say "Yes" or "No" as the task complexity increases. Hence you are actually engaged in reasoning.

Though I don't find your style of argument good-faith, but wish fulfilment and arguments from ignorance. "I do not know anything about human beings, and I will tell you that you do not, so that I can preserve what I wish to be true" -- this is the position of almost all defenders of AI i've encountered, and I find it disagreeably religious.

viraptor · on Aug 13, 2024

This falls apart at claim 1, because it's for a token, not the answer. You can start a reasoning process, but I don't know of anyone seriously claiming it happens in a single step. LLMs have a fixed budget per token, but the result you normally want is multiple tokens long. Engaging in chain-of-thought answer means a variable number of tokens, with the length depending on the complexity of the task. In this thread I mentioned the prompt + generated subtasks, so it's not like we're talking about single-token outputs.

Or in a simple example, the answer to "using step by step reasoning, what's the colour of a white sheet" is significantly shorter than "using step by step reasoning, why is Chewbacca defence a fallacy". The more complex task automatically got more steps and more tokens dedicated to it.

> Humans do not have a fixed computation budget per token.

That's not even wrong. Humans don't operate on tokens. That just doesn't make sense.

mjburgess · on Aug 14, 2024

At least two strategies of reply: 1) choose problems P1..n such that they have yes/no answers; 2) rephrase my argument to consider max-tokens given any prompt.

But I think you could see these yourself, so I'm at a loss as to what the purpose of this game is.

What is the motivation to insist that animal cognition consists of some process similar to compressing 1PB of books into frequency associations and computing a conditional probability?

Is it some desire to believe one hasnt been duped by the technology? "No no, I am no fool, it really is real!" etc. ? Or wishfullfilment of some other sort? Or just the cynical view that if one is nihilistic about people one is most likely right? (Each, of course, a pseudoscience of its own sort).

kasmura · on Aug 12, 2024

I disagree. Ring attention and tree attention are so general that the core ideas are independent of details about modern GPUs. Maybe so for flash attention but not this. I also disagree because these algorithms are fundamentally about enabling long context by distributing across gpus and this would not be enabled by “moored law for gpu hardware”

FeepingCreature · on Aug 12, 2024

Rather, the amount of computation for every token is bounded.

doctorpangloss · on Aug 12, 2024

But often, less training performs better.

mjburgess · on Aug 12, 2024

You'd have to give me an example.

If less training performs better, it's because it's overfitting. In the case of NLP/CV you, in general, cannot really "overfit" because most of the goal isnt predictive, but rather representational -- so training=templating, rather than predicting. In these cases training for longer should mostly improve the quality of the output, since there isnt any ground truth to overfit against.

spencerchubb · on Aug 12, 2024

Hopefully people will figure out how to make AI work with dynamic compute rather than fixed compute!

brrrrrm · on Aug 12, 2024

how does this approach differ from Nvidia's 2019 writeup on using trees to improve allreduce operations? https://developer.nvidia.com/blog/massively-scale-deep-learn...

mrfox321 · on Aug 12, 2024

Similar, but specialized to softmax(QK)V computation

tveita · on Aug 12, 2024

The same authors also have a language model at https://github.com/Zyphra/Zamba2 but it's not clear to me if that model is connected to tree attention.

The announcement at https://www.zyphra.com/post/zamba2-small links to this paper, but the paper doesn't actually mention Zamba2 anywhere.

Narhem · on Aug 12, 2024

How often do papers like this make it to industry applications/published research. Seems stuck in between the two.

mjburgess · on Aug 12, 2024

It's from industry. It's a funded research startup.

mithametacs · on Aug 12, 2024

This is going to shred when it reaches industry.

But yeah very math heavy for a software engineering paper.

ynniv · on Aug 12, 2024

How long can a page of python take? https://github.com/Zyphra/tree_attention/blob/main/tree_shar...

mithametacs · on Aug 12, 2024

You seem to know your stuff.

Will this technique work with existing model weights?

kasmura · on Aug 12, 2024

Yes, it is just a way of computing the self-attention in a distributed way

cs702 · on Aug 12, 2024

Interesting.

The authors claim this outperforms Ring Attention for distributed computation of self-attention over multiple GPUs.

Distributing computation is necessary whenever context is too long for self-attention's computation to fit in a single GPU's available memory.

Github link: https://github.com/Zyphra/tree_attention