I think there is no way to tell and we can only see with more research and time. One nuanced part that might not be clear is the transformer was a huge part of what made traditional LLMs scale.
With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them.
I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google.
AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.
Assuming your goal is mimicking the training data, you need some mechanism for drawing from the same distribution. AR happens to provide that -- it's a particular factorization of conditional probabilities which yields the same distribution you started with, and it's one you're able to replicate in your training data.
AR is not the only possible solution, but many other techniques floating around do not have that property of actually learning the right thing. Moreover, since the proposed limitation (not being able to think a long time about your response before continuing) is a byproduct of current architectures rather than a fundamental flaw with AR, it's not as obvious as it might seem that you'd want to axe the technique.
A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.
This is because it can edit and doesn’t suffer from early token bias.