Diffusion is more than just speed. Early benchmarks show it better at reasoning ...

martincsweiss · 2025-05-22T01:35:22 1747877722

This is a super interesting claim - can you point to these benchmarks?

cubefox · 2025-05-22T02:02:30 1747879350

https://deepmind.google/models/gemini-diffusion/#benchmarks

> Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster.

That doesn't necessarily mean that they scale as well as autoregressive models.

jimmyl02 · 2025-05-22T03:17:10 1747883830

I think there is no way to tell and we can only see with more research and time. One nuanced part that might not be clear is the transformer was a huge part of what made traditional LLMs scale.

With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them.

I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google.

mdp2021 · 2025-05-22T02:15:32 1747880132

Try this one:

# d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

https://dllm-reasoning.github.io/

mountainriver · 2025-05-22T04:07:01 1747886821

https://github.com/HKUNLP/diffusion-vs-ar

mdp2021 · 2025-05-22T04:22:17 1747887737

I.e.: https://arxiv.org/html/2410.14157v3

# Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

hansvm · 2025-05-22T03:33:43 1747884823

AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.

mdp2021 · 2025-05-22T04:18:18 1747887498

> AR in general is critical for learning the right distribution

Could you please clarify that?

hansvm · 2025-05-22T05:38:44 1747892324

Assuming your goal is mimicking the training data, you need some mechanism for drawing from the same distribution. AR happens to provide that -- it's a particular factorization of conditional probabilities which yields the same distribution you started with, and it's one you're able to replicate in your training data.

AR is not the only possible solution, but many other techniques floating around do not have that property of actually learning the right thing. Moreover, since the proposed limitation (not being able to think a long time about your response before continuing) is a byproduct of current architectures rather than a fundamental flaw with AR, it's not as obvious as it might seem that you'd want to axe the technique.

vessenes · 2025-05-22T02:01:23 1747879283

A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.