Disclaimer: I'm shared first author of this paper.
As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.
Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?
So does anything do proper state tracking? And don’t point to the OP since very often purportedly better new architectures end up being basically vaporware (like mamba or rkwv, which still don’t have good quality pre trained models yet)
Surely whether a big model using a certain system exists is only a matter of the choices of those with sufficient resources to train it. That's only a matter of their beliefs, not about actual model performance.
Congratulations on the paper. That's some very interesting work!
But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.
Thank you! I can say that it is not really a diminishing factor at the scales reported in the paper. So, xLSTM[7:1] is pretty much on par with xLSTM[1:0] in speed.
We show that it is helpful on toy tasks, and it shows even better sequence extrapolation performance, so yes.
Great work! I'd love to start using the language model variant of your work. Do you know when/if it will be open sourced? I'd start using it today if it were that soon.
> For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture
Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?
You mainly got it right. Usually one does have many scalar 'c' cells, that talk to each other via memory mixing. For the sLSTM, you group them into heads, talking only to cells within the same head. The reason that we referred to scalar cells here is that these are that fundamental building block. Many of them can and are usually combined and vector notation is useful in this case.
For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.
To clarify, is the sLSTM strictly necessary (to achieve better accuracy than those other architectures), or is the mLSTM good enough? The [1/0] model in the paper seemed to do quite well.
As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.