Tensor decomposition has traditionally suffered from high computational complexi...

verdverm · 2025-01-22T04:59:45 1737521985

My math is rusty, but it looks to have a higher complexity than the original attention. I cannot say if it is an issue. Generally it seems we are willing to spend more computation at training time if it produces better results at inference time. In this case they are reducing the resources needed at inference time (an order of magnitude for the KV cache) or enabling longer sequences given the same resources.

There's another paper I saw yesterday, "Element-wise Attention is All You Need" which looks like an early preprint, written by a solo author with a solo A800, and tested on some smaller problems. If the results hold up for language benchmarks, it could reduce resource requirements during training as well. It looks to have a lower complexity when scaling

https://arxiv.org/abs/2501.05730

davmre · 2025-01-22T08:45:44 1737535544

They're not proposing to apply tensor decomposition to an existing collection of weights. It's an architecture in which the K, V, and Q tensors are constructed as a product of factors. The model works with the factors directly and you just need to compute their product on the forward pass (and adjoints on the backwards pass), so there's no decomposition.

absolutelastone · 2025-01-22T05:01:05 1737522065

Looks like it's just a matrix decomposition in the paper. I'm guessing anyway. These attention papers are always a painful mix of mathematical, quasi-mathematical, and information retrieval jargon.

There is something in the github repo about higher-order decompositions. Don't find where the method for factoring is given.

verdverm · 2025-01-22T05:04:35 1737522275

I chuckled when I read, in S-3.1

> Specifically, for each token t, with a small abuse of notation, we define:

jamessb · 2025-01-22T10:53:57 1737543237

"Abuse of notation" is a commonly used term: https://en.wikipedia.org/wiki/Abuse_of_notation

dartos · 2025-01-22T05:09:22 1737522562

At a sniff test it would make sense.

Trading computational complexity for space.