Also, funny how they included GPT-5.0 and 5.1 but not 5.2... I'm pretty sure they ran the benchmarks for 5.0, then 5.1 came out, so they ran the benchmarks for 5.1... and then 5.2 came out and they threw their hands up in the air and said "fuck it".
- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)
- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.
- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.
Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.
I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?
These are potentially complementary approaches. Various innovations have shrunk the KV cache size or (with DSA) how much work you have to do in each attention step. This paper is about hybrid models where some layers' state needs don't grow with context size at all.
SSMs have a fixed-size state space, so on their own they'll never going be able to recite a whole file of your code in a code-editing session for example. But if much of what an LLM is doing isn't long-distance recall, you might be able to get away with only giving some layers full recall capability, with other layers manipulating the info already retrieved (plus whatever's in their own more limited memory).
I think Kimi Linear Attention and Qwen3-next are both doing things a little like this: most layers' attention/memory doesn't grow with context size. Another approach, used in Google's small open Gemma models, is to give some layers only 'local' attention (most recent N tokens) and give a few 'full' (whole context window) attention. I guess we're seeing how those approaches play out and how different tricks can be cobbled together.
There can potentially be a moneyball aspect to good model architecture. Even if on its own using space-saving attention mechanisms in some layers of big models cost something in performance, their efficiency could allow you to 'spend' more elsewhere (more layers or more params or such) to end with overall better performance at a certain level of resources. Seems like it's good to have experiments with many different approaches going on.
That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out.
> Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025
I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept.
Don't forget the billion dollars or so of GPU's they had access to that they left out of that accounting. Also, the R&D cost of the Meta model they originally used. Then, they added $5.6 million on top of that.
Here's what's important about this paper. It is written by AMD researchers. It shows AMD is investing in AI research. Is this the same level of achievement as DeepSeek 3.2. Most likely not. Do they have novel ideas? Difficult to say, there are hundreds of new ideas being tried in this space. Is this worthless? Most certainly not. In order to make progress in this domain (as in any other), you first need to get your feet wet. You need to play with the various components, and see how they fit together. The idea in this paper is that you can combine somehow SSMs (like Mamba) and LLMs (like LLama). The examples they give are absolute toys compared to DeepSeek 3.2 (the largest is 8 billion parameters, while DeepSeek 3.2 has 671 billion parameters). The comparison you are trying to make simply does not apply. The good news for all of us is that AMD is working in this space.
Mamba based LLMs aren't even close to novel though. IBM's been doing this since forever [1].
Also, you're off on Deepseek V3.2's param count, the full model's 685B in size with the MTP layer.
I don't think there's anything interesting here other than "I guess AMD put out a research paper", and it's not cutting edge when Deepseek or even IBM is running laps around them.
"In the last 2 years we've put so much resourcing into, into reasoning and one byproduct of that is you lose a little bit of muscle on pre training and post training."
"In the last six months, @merettm and I have done a lot of work to build that muscle back up."
"With all the focus on RL, there's an alpha for us because we think there's so much room left in pre training."
"As a result of these efforts, we've been training much stronger models. And that also gives us a lot of confidence carrying into Gemini 3 and other releases coming this end of the year."
But it's pretty clear that the last full pretrain run they've released is for gpt-4o 2 years ago*, and since then they've just been iterating RL for their models. You don't need any insider information to notice that, it's pretty obvious.
*Excluding GPT-4.5 of course, but even OpenAI probably wants us to forget about that.
The catch that you're missing is that Deepseek did this ages ago.
They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.
Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.
For a "copy Deepseek's homework" model, it's really good, preferable to DeepSeek for me (at least prior to V3.2, which I haven't been able to fully put through its paces yet). post-training really makes that much of a difference I guess
Linear attention is really bad, it's only good for benchmaxing but it leads to a loss of valuable granularity, which can be felt in the latest DeepSeek randomly forgetting/ignoring/correcting explicitly stated facts in the prompt.
What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture.
And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.
Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.
DeepSeek-V3.2 is a sparse attention architecture, while Zebra-Llama is a hybrid attention/SSM architecture. The outcome might be similar in some ways (close to linear complexity) but I think they are otherwise quite different.
And when it comes to privacy, consumer advocate types and privacy wonks (I include myself in this group) want the heater to be on, and technology companies and advertising companies and all of their hangers-on want the heater to be off.
One group has a lot more money, power, and influence than the other.
It is the perfect and correct antidote to any slippery slope argument. If the consequences of the law turns out to be as bad as you say they will be then we adjust the law.
Bizarrely horrible approach. A lot of damage would already be done, most importantly changing the status quo is inherently much harder than doing nothing. So going back won’t necessarily be straightforward.
Claiming that “slippery slope” is always a fallacy is a gross misconception and misinterpretation. It varies case by case, very often it can be a perfectly rational argument.
“Let’s restrict democracy and individual freedoms just a bit, maybe an authoritarian strongman is just what we need to get us out of this mess, we can always go back later..”
“Let’s try scanning all personal communication in a non intrusive way, if it doesn’t solve CSAM problems we can always adjust the law”, right.. as if that was ever going to happen.
Some lines need to be drawn that can never be crossed regardless of any good and well reasoned intentions.
I very heavily disagree here, we aren't doing as much of this as we should be.
Society is too complex of a system to predict what consequences a law will have. Badly written laws slip through. Loopholes are discovered after the fact. Incentives do what incentives do, and people eventually figure out how to game them to their own benefit. First order effects cause second order effects, which cause third order effects. Technology changes. We can't predict all of that in advance.
Trying to write a perfect law is like trying to write a perfect program on your first try, with no testing and verification, just reasoning about it in a notebook. If the code or law is of any complexity, it just can't be done. Programmers have figured this out and came up with ways to mitigate the problem, from unit testing and formal verification to canaries, feature flags, blue-green deployments and slow rollouts. Lawmakers could learn those same lessons (and use very similar strategies), but that is very rarely done.
In the same post you are arguing for and against "slippery slope".
Either it is possible to easy change law to make it worse ("slippery slope" is valid objection) or changing law is "much harder than doing nothing"("slippery slope" is a fallacy).
>Some lines need to be drawn that can never be crossed regardless of any good and well reasoned intentions.
Too late. We already let the government cross the lines during Covid with freedom of movement and freedom of speech restrictions, and they got away with it because it was "for your protection". Now a lot of EU countries are crossing them even more also "for your protection" due to "Russian misinformation" and "far right/hate speech" scaremongering, which at this point is a label applied loosely to anyone speaking against unpopular government policies or exposing their corruption.
And the snowball effect continues. Governments are only increasing their grip on power(looking enviously at what China has achieved), not loosening it back. And worse, not only are they more authoritarian, but they're also practicing selective enforcement of said strict rules with the justification that it's OK because we're doing it to the "bad guys". I'm afraid we aren't gonna go back to the levels of freedom we had in 2014- 2019, that ship has long sailed.
Nothing is more permanent in politics than temporary solution. As a Norwegian, for example, I am still paying a temporary 25% on all spending that was enacted as a "temporary" measure over 100 years ago.
Control Theory does not work (in the general) for politics for the simple reason that incentives are misaligned. That is to say that control theory itself obviosuly works, but for it to be a good solution in some political context you must additionally prove the existance of some Nash equilibrium where it is being correctly applied.
The thesis argues that dictators regularly both harm groups clearly inside the winning coalition, and please groups clearly outside of it. A common, but not the only reason, is ideology.
One has to be careful when using game-theory models on messy human entities. Sometimes it works, sometimes it doesn't, and it's hard to determine just at what point the model breaks down. At least without empirical research.
(Another example is that actual negotiation outcomes rarely end up at the minimax or Nash product equilibria that game theory sequential negotiation concepts would suggest.)
> If the consequences of the law turns out to be as bad
This is the usual "the market will regulate itself" argument. It works when the imbalance arises organically, not so much when it's intentional on the side with more power and part of their larger roadmap.
The conflict of interest needs to be accounted for. Consequences for whom? Think of initiatives like any generic backdooring of encrypted communication but legislators are exempt. If legislators aren't truly dogfooding the results of that law then there's no real "market pressure" to fix anything. There's only "deployment strategy", roll out the changes slowly enough that the people have time to acclimate.
Control theory doesn't apply all that well to dynamical systems made entirely of human beings. You need psychohistory for that.
So, you do think “useCase.regulation” being a single dial. It’s a pretty reductive framework. I have an easier framework where in 90% of cases current law was already good enough and we don’t need to tweak that dial
Inference is usually less gpu-compute heavy, but much more gpu-vram heavy pound-for-pound compared to training. General rule of thumb is that you need 20x more vram for training a model with X params, than for inference for that same size model. So assuming batch size b, then serving more than 20*b users would tilt vram use on the side of inference.
This isn't really accurate; it's an extremely rough rule of thumb and ignores a lot of stuff. But it's important to point out that inference is quickly adding to costs for all AI companies. Deepseek claims that they used $5.6mil to train Deepseek R1; that's about 10-20 trillion tokens at their current pricing- or 1 million users sending just 100 requests at full context size.
Well, the counterargument is that in theory, you can imagine a way to create structural color regardless of substrate. So imagine a technology that shines a laser on a car or a block of concrete and makes it blue; I'd argue that's correctly "without chemicals".
Of course, I doubt you can do that to any random substrate, since the color will depend on the properties of the material.
Optimizing a language for LLM consumption and generation (probably) doesn't mean you want a LLM designing it.
reply