More

jychang · 2026-01-09T04:07:50 1767931670

Best NPU app so far is Trex for Mac.

jychang · 2026-01-01T01:29:32 1767230972

Easy to produce != easy to consume, and vice versa

Optimizing a language for LLM consumption and generation (probably) doesn't mean you want a LLM designing it.

jychang · 2025-12-22T20:44:08 1766436248

Also, funny how they included GPT-5.0 and 5.1 but not 5.2... I'm pretty sure they ran the benchmarks for 5.0, then 5.1 came out, so they ran the benchmarks for 5.1... and then 5.2 came out and they threw their hands up in the air and said "fuck it".

rynn · 2025-12-22T23:17:36 1766445456

gpt-5.2 codex isn't available in the API yet.

If you want to be picky they could've compared it against gpt-5 pro gpt-5.2 gpt-5.1 gpt-5.1-codex-max gpt-5.2 pro

all depending on when they ran benchmarks (unless, of course, they are simply copying OAI's marketing).

At some point it's enough to give OAI a fair shot and let OAI come out with their own PR, which they doubtlessly will.

XCSme · 2025-12-22T20:50:25 1766436625

I didn't even notice that, I assumed it was the latest GPT version.

amelius · 2025-12-22T22:35:52 1766442952

after or before running the benchmarks?

jychang · 2025-12-18T02:36:25 1766025385

Rybelsus exists already, so there's that

jychang · 2025-12-06T23:12:55 1765062775

Or like this: https://api-docs.deepseek.com/news/news251201

I don't know what's so special about this paper.

- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)

- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.

- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.

Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.

I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?

twotwotwo · 2025-12-07T06:29:54 1765088994

These are potentially complementary approaches. Various innovations have shrunk the KV cache size or (with DSA) how much work you have to do in each attention step. This paper is about hybrid models where some layers' state needs don't grow with context size at all.

SSMs have a fixed-size state space, so on their own they'll never going be able to recite a whole file of your code in a code-editing session for example. But if much of what an LLM is doing isn't long-distance recall, you might be able to get away with only giving some layers full recall capability, with other layers manipulating the info already retrieved (plus whatever's in their own more limited memory).

I think Kimi Linear Attention and Qwen3-next are both doing things a little like this: most layers' attention/memory doesn't grow with context size. Another approach, used in Google's small open Gemma models, is to give some layers only 'local' attention (most recent N tokens) and give a few 'full' (whole context window) attention. I guess we're seeing how those approaches play out and how different tricks can be cobbled together.

There can potentially be a moneyball aspect to good model architecture. Even if on its own using space-saving attention mechanisms in some layers of big models cost something in performance, their efficiency could allow you to 'spend' more elsewhere (more layers or more params or such) to end with overall better performance at a certain level of resources. Seems like it's good to have experiments with many different approaches going on.

T-A · 2025-12-07T00:26:31 1765067191

From your link: DeepSeek-V3.2 Release 2025/12/01

From Zebra-Llama's arXiv page: Submitted on 22 May 2025

Palmik · 2025-12-07T10:26:12 1765103172

DeepSeek's MLA paper was published in 2024: https://arxiv.org/abs/2405.04434

DeepSeek's Sparse Attention paper was published in February: https://arxiv.org/abs/2502.11089

DeepSeek 3.2 Exp (combining MLA and DSA) was released in September.

You also had several other Chinese hybrid models, like Qwen3 Next and Minimax M1.

jychang · 2025-12-07T06:22:45 1765088565

That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out.

> Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025

I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept.

[1] https://huggingface.co/ibm-granite/granite-4.0-tiny-base-pre...

[2] https://youtu.be/Q5baLehv5So?t=6075

nickpsecurity · 2025-12-07T00:58:51 1765069131

"Deepseek hasn't done a full $5.6mil full "

Don't forget the billion dollars or so of GPU's they had access to that they left out of that accounting. Also, the R&D cost of the Meta model they originally used. Then, they added $5.6 million on top of that.

credit_guy · 2025-12-07T01:11:43 1765069903

Here's what's important about this paper. It is written by AMD researchers. It shows AMD is investing in AI research. Is this the same level of achievement as DeepSeek 3.2. Most likely not. Do they have novel ideas? Difficult to say, there are hundreds of new ideas being tried in this space. Is this worthless? Most certainly not. In order to make progress in this domain (as in any other), you first need to get your feet wet. You need to play with the various components, and see how they fit together. The idea in this paper is that you can combine somehow SSMs (like Mamba) and LLMs (like LLama). The examples they give are absolute toys compared to DeepSeek 3.2 (the largest is 8 billion parameters, while DeepSeek 3.2 has 671 billion parameters). The comparison you are trying to make simply does not apply. The good news for all of us is that AMD is working in this space.

jychang · 2025-12-07T06:20:04 1765088404

Mamba based LLMs aren't even close to novel though. IBM's been doing this since forever [1].

Also, you're off on Deepseek V3.2's param count, the full model's 685B in size with the MTP layer.

I don't think there's anything interesting here other than "I guess AMD put out a research paper", and it's not cutting edge when Deepseek or even IBM is running laps around them.

[1] Here's a news article from April, although IBM has been doing it for a long time before that https://research.ibm.com/blog/bamba-ssm-transformer-model

credit_guy · 2025-12-08T03:17:30 1765163850

It's not cutting edge, so what? Your point is that nobody should publish anything unless it is cutting edge?

jychang · 2025-12-08T13:31:28 1765200688

Yeah, that's the point of publishing. You get scooped, you lose.

credit_guy · 2025-12-08T18:29:50 1765218590

This wasn’t published, it was just posted to the arxiv.

SilverElfin · 2025-12-07T05:36:07 1765085767

How did you get all this info about how each is trained? Is that something they admit now or is it through leaks?

jychang · 2025-12-07T06:40:01 1765089601

Deepseek? It's literally in their research papers.

OpenAI? The OpenAI head of research @markchen90 straight up admitted it in a podcast.

https://x.com/petergostev/status/1995744289079656834

"In the last 2 years we've put so much resourcing into, into reasoning and one byproduct of that is you lose a little bit of muscle on pre training and post training." "In the last six months, @merettm and I have done a lot of work to build that muscle back up." "With all the focus on RL, there's an alpha for us because we think there's so much room left in pre training." "As a result of these efforts, we've been training much stronger models. And that also gives us a lot of confidence carrying into Gemini 3 and other releases coming this end of the year."

Note, "alpha" in the quote above is referring to https://en.wikipedia.org/wiki/Alpha_(finance)

But it's pretty clear that the last full pretrain run they've released is for gpt-4o 2 years ago*, and since then they've just been iterating RL for their models. You don't need any insider information to notice that, it's pretty obvious.

*Excluding GPT-4.5 of course, but even OpenAI probably wants us to forget about that.

nl · 2025-12-07T12:30:57 1765110657

Semi-analysis also believes they haven't done a fill pretraining run since 4o (except for GPT-4.5): https://open.substack.com/pub/semianalysis/p/tpuv7-google-ta...

jychang · 2025-12-06T23:04:35 1765062275

The catch that you're missing is that Deepseek did this ages ago.

They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.

Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.

erichocean · 2025-12-06T23:10:57 1765062657

Kimi K2 also uses MLA, and Kimi Linear runs Kimi Delta Attention (it's SSM-like) for three out of every four layers (the fourth uses MLA).

jychang · 2025-12-06T23:21:11 1765063271

Kimi K2 is literally a "copy Deepseek's homework" model. Seriously. It's even exactly 61 layers, the same as Deepseek V3/R1.

logicprog · 2025-12-07T02:04:07 1765073047

For a "copy Deepseek's homework" model, it's really good, preferable to DeepSeek for me (at least prior to V3.2, which I haven't been able to fully put through its paces yet). post-training really makes that much of a difference I guess

storus · 2025-12-06T23:14:54 1765062894

Linear attention is really bad, it's only good for benchmaxing but it leads to a loss of valuable granularity, which can be felt in the latest DeepSeek randomly forgetting/ignoring/correcting explicitly stated facts in the prompt.

jychang · 2025-12-06T22:59:40 1765061980

What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture.

And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.

Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.

cubefox · 2025-12-07T16:04:32 1765123472

DeepSeek-V3.2 is a sparse attention architecture, while Zebra-Llama is a hybrid attention/SSM architecture. The outcome might be similar in some ways (close to linear complexity) but I think they are otherwise quite different.

jychang · 2025-11-19T23:42:37 1763595757

I strongly agree with this position. This is basically the foundation of Control Theory!

https://en.wikipedia.org/wiki/Control_theory

This is like arguing if "heater on" or "AC on" is better, which is a pointless argument. That entirely depends on what the temperature is!

amelius · 2025-11-20T11:42:20 1763638940

> This is like arguing if "heater on" or "AC on" is better, which is a pointless argument. That entirely depends on what the temperature is!

I think the problem here is more that _some_ people want the heater to be on and _other_ people want the heater to be off.

BLKNSLVR · 2025-11-20T12:24:55 1763641495

And when it comes to privacy, consumer advocate types and privacy wonks (I include myself in this group) want the heater to be on, and technology companies and advertising companies and all of their hangers-on want the heater to be off.

One group has a lot more money, power, and influence than the other.

SoftTalker · 2025-11-20T03:32:41 1763609561

And, at least in your example, sometimes you need both at the same time!

bjourne · 2025-11-20T04:19:09 1763612349

It is the perfect and correct antidote to any slippery slope argument. If the consequences of the law turns out to be as bad as you say they will be then we adjust the law.

wqaatwt · 2025-11-20T08:23:18 1763626998

> they will be then we adjust the law.

Bizarrely horrible approach. A lot of damage would already be done, most importantly changing the status quo is inherently much harder than doing nothing. So going back won’t necessarily be straightforward.

Claiming that “slippery slope” is always a fallacy is a gross misconception and misinterpretation. It varies case by case, very often it can be a perfectly rational argument.

“Let’s restrict democracy and individual freedoms just a bit, maybe an authoritarian strongman is just what we need to get us out of this mess, we can always go back later..”

“Let’s try scanning all personal communication in a non intrusive way, if it doesn’t solve CSAM problems we can always adjust the law”, right.. as if that was ever going to happen.

Some lines need to be drawn that can never be crossed regardless of any good and well reasoned intentions.

miki123211 · 2025-11-20T10:12:47 1763633567

> Bizarrely horrible approach

I very heavily disagree here, we aren't doing as much of this as we should be.

Society is too complex of a system to predict what consequences a law will have. Badly written laws slip through. Loopholes are discovered after the fact. Incentives do what incentives do, and people eventually figure out how to game them to their own benefit. First order effects cause second order effects, which cause third order effects. Technology changes. We can't predict all of that in advance.

Trying to write a perfect law is like trying to write a perfect program on your first try, with no testing and verification, just reasoning about it in a notebook. If the code or law is of any complexity, it just can't be done. Programmers have figured this out and came up with ways to mitigate the problem, from unit testing and formal verification to canaries, feature flags, blue-green deployments and slow rollouts. Lawmakers could learn those same lessons (and use very similar strategies), but that is very rarely done.

bjourne · 2025-11-20T22:23:16 1763677396

That's exactly what I meant. Well explained!

Sankozi · 2025-11-20T12:41:42 1763642502

In the same post you are arguing for and against "slippery slope".

Either it is possible to easy change law to make it worse ("slippery slope" is valid objection) or changing law is "much harder than doing nothing"("slippery slope" is a fallacy).

jack_tripper · 2025-11-20T09:14:28 1763630068

>Some lines need to be drawn that can never be crossed regardless of any good and well reasoned intentions.

Too late. We already let the government cross the lines during Covid with freedom of movement and freedom of speech restrictions, and they got away with it because it was "for your protection". Now a lot of EU countries are crossing them even more also "for your protection" due to "Russian misinformation" and "far right/hate speech" scaremongering, which at this point is a label applied loosely to anyone speaking against unpopular government policies or exposing their corruption.

And the snowball effect continues. Governments are only increasing their grip on power(looking enviously at what China has achieved), not loosening it back. And worse, not only are they more authoritarian, but they're also practicing selective enforcement of said strict rules with the justification that it's OK because we're doing it to the "bad guys". I'm afraid we aren't gonna go back to the levels of freedom we had in 2014- 2019, that ship has long sailed.

pfdietz · 2025-11-20T12:02:02 1763640122

The libertarian approach to COVID would be that infecting someone is assault and you are justified in shooting someone who is trying to do that.

kriops · 2025-11-20T11:31:40 1763638300

Nothing is more permanent in politics than temporary solution. As a Norwegian, for example, I am still paying a temporary 25% on all spending that was enacted as a "temporary" measure over 100 years ago.

Control Theory does not work (in the general) for politics for the simple reason that incentives are misaligned. That is to say that control theory itself obviosuly works, but for it to be a good solution in some political context you must additionally prove the existance of some Nash equilibrium where it is being correctly applied.

Edit: See https://www.youtube.com/watch?v=rStL7niR7gs (CGP Grey - Why Do All Governments Work the Same Way?)

throw9018 · 2025-11-20T14:40:58 1763649658

As a counterpoint to the selectorate theory, see Thorsen's PhD dissertation, "Only In It for Power and Wealth?", https://politica.dk/fileadmin/politica/Dokumenter/Afhandling...

The thesis argues that dictators regularly both harm groups clearly inside the winning coalition, and please groups clearly outside of it. A common, but not the only reason, is ideology.

One has to be careful when using game-theory models on messy human entities. Sometimes it works, sometimes it doesn't, and it's hard to determine just at what point the model breaks down. At least without empirical research.

(Another example is that actual negotiation outcomes rarely end up at the minimax or Nash product equilibria that game theory sequential negotiation concepts would suggest.)

close04 · 2025-11-20T08:13:43 1763626423

> If the consequences of the law turns out to be as bad

This is the usual "the market will regulate itself" argument. It works when the imbalance arises organically, not so much when it's intentional on the side with more power and part of their larger roadmap.

The conflict of interest needs to be accounted for. Consequences for whom? Think of initiatives like any generic backdooring of encrypted communication but legislators are exempt. If legislators aren't truly dogfooding the results of that law then there's no real "market pressure" to fix anything. There's only "deployment strategy", roll out the changes slowly enough that the people have time to acclimate.

Control theory doesn't apply all that well to dynamical systems made entirely of human beings. You need psychohistory for that.

mk89 · 2025-11-20T03:07:11 1763608031

Reminds me of the book Thinking in Systems.

Thanks for the link.

ttoinou · 2025-11-20T05:14:53 1763615693

So, you do think “useCase.regulation” being a single dial. It’s a pretty reductive framework. I have an easier framework where in 90% of cases current law was already good enough and we don’t need to tweak that dial

TylerE · 2025-11-20T05:43:20 1763617400

The road to hell is paved with “good enough”.

adi_kurian · 2025-11-20T06:14:38 1763619278

Is the road to nowhere paved with "perfect"?

Perhaps not when it comes to matters like these.

TylerE · 2025-11-20T23:14:08 1763680448

That’s why you aim for at least actually good or even excellent, not mediocrity.

littlestymaar · 2025-11-20T07:42:25 1763624545

It's a funny thing to say because the popular saying you're modifying says the exact opposite.

TylerE · 2025-11-20T17:16:59 1763659019

In practice, “good enough” is rarely actually good enough.

jychang · 2025-11-11T00:46:55 1762822015

That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.

jychang · 2025-11-06T23:20:08 1762471208

That entirely depends on the number of users.

Inference is usually less gpu-compute heavy, but much more gpu-vram heavy pound-for-pound compared to training. General rule of thumb is that you need 20x more vram for training a model with X params, than for inference for that same size model. So assuming batch size b, then serving more than 20*b users would tilt vram use on the side of inference.

This isn't really accurate; it's an extremely rough rule of thumb and ignores a lot of stuff. But it's important to point out that inference is quickly adding to costs for all AI companies. Deepseek claims that they used $5.6mil to train Deepseek R1; that's about 10-20 trillion tokens at their current pricing- or 1 million users sending just 100 requests at full context size.