Disclaimer: I'm shared first author of this paper. As a clarification: The speed...

brookst · on May 8, 2024

Congrats on the paper, very interesting.

Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?

1. https://www.etched.com/

2. https://www.embedded.com/ai-chip-features-hardware-support-f...

deepnet · on May 8, 2024

Fascinating work, very promising.

Can you summarise how the model in your paper differs from this implementation of xLSTM ?

https://github.com/huggingface/transformers/issues/27011

korbip · on May 12, 2024

Thanks! I don't see any implementation there. In any case, we are planning a code release soon.

WithinReason · on May 8, 2024

Can you expand on the "cannot solve fundamentally" part?

lucidrains · on May 8, 2024

https://arxiv.org/abs/2404.08819

Der_Einzige · on May 8, 2024

So does anything do proper state tracking? And don’t point to the OP since very often purportedly better new architectures end up being basically vaporware (like mamba or rkwv, which still don’t have good quality pre trained models yet)

impossiblefork · on May 9, 2024

How do you mean vaporware?

Surely whether a big model using a certain system exists is only a matter of the choices of those with sufficient resources to train it. That's only a matter of their beliefs, not about actual model performance.

thomasahle · on May 9, 2024

Transformers and SSMs can't do long computations that are inherently sequential.

Unless you give them chain of thought. In which case they do great.

albertzeyer · on May 8, 2024

Congratulations on the paper. That's some very interesting work!

But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.

korbip · on May 8, 2024

Thank you! I can say that it is not really a diminishing factor at the scales reported in the paper. So, xLSTM[7:1] is pretty much on par with xLSTM[1:0] in speed. We show that it is helpful on toy tasks, and it shows even better sequence extrapolation performance, so yes.

goldemerald · on May 8, 2024

Great work! I'd love to start using the language model variant of your work. Do you know when/if it will be open sourced? I'd start using it today if it were that soon.

SpaceManNabs · on May 8, 2024

> For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture

Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?

hh1 · on May 9, 2024

When you talk about "c" or "scalar memory" in the paper, does that refer to a single unit in the vector usually referred to as c?

So in mLSTM, each unit of the vector c is now a matrix (so a 3d tensor)? And we refer to each matrix as a head?

Having a bit of issue understanding this fundamental part

korbip · on May 12, 2024

You mainly got it right. Usually one does have many scalar 'c' cells, that talk to each other via memory mixing. For the sLSTM, you group them into heads, talking only to cells within the same head. The reason that we referred to scalar cells here is that these are that fundamental building block. Many of them can and are usually combined and vector notation is useful in this case.

For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.

logicchains · on May 8, 2024

To clarify, is the sLSTM strictly necessary (to achieve better accuracy than those other architectures), or is the mLSTM good enough? The [1/0] model in the paper seemed to do quite well.

korbip · on May 8, 2024

For language in general it seems fine. But there might be specific tasks where it is necessary indeed.