More

pyentropy · 2026-06-07T21:09:02 1780866542

LLM-judge/parallel branching ≠ multi-token prediction ≠ reasoning effort.

See https://developers.openai.com/cookbook/articles/openai-harmo... and src/openai/types/shared/reasoning_effort.py

bjourne · 2026-06-08T19:18:38 1780946318

[flagged]

simianwords · 2026-06-08T20:45:21 1780951521

No it doesn't and lets not call people names. You can verify this using ChatGPT or anything else. You are mistaken and there are no "branches" happening.

bjourne · 2026-06-09T08:16:11 1780992971

And u can verify with actual facts that u are embarassingly wrong.

pyentropy · 2026-06-07T16:38:10 1780850290

The number of tokens you predict at time (multi or not) has nothing to do with whether the model wants to emit any, some or a lot of reasoning tokens in reasoning tag -- similar to how branch prediction will not really change the for loop iteration count.

sometimelurker · 2026-06-08T01:16:56 1780881416

no it might. a high reasoning task is probably harder than a low reasoning task, so the same MTP LLM will predict more correct tokens on the low reasoning task. to compensate for this, big labs likely have different MTP LLMs for different cases. it would make sense for them to do this

pyentropy · 2026-06-07T15:48:15 1780847295

Take a look at the harmony repo which specifies the internal OpenAI format - the effort level is specified in the context after the <|start|> tag - https://github.com/openai/harmony

Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.

pyentropy · 2026-06-07T21:09:55 1780866595

Examples with inference of different reasoning effort levels is in the OpenAI docs as well - https://developers.openai.com/cookbook/articles/openai-harmo...

https://docs.vllm.ai/en/latest/features/reasoning_outputs/#a...

https://developers.openai.com/api/docs/guides/reasoning

simianwords · 2026-06-07T21:14:30 1780866870

I think you have the right answer but I'm struggling to understand: does changing the effort change the prompt at the start of the conversation? I wonder why come up with this way at all? Why not just add a parameter at the end or something? At least it won't break cache.

Maybe like: add a secret suffix to your chat in the conversation to think more like

   conversation....

   Hey please help
   [think more]

pyentropy · 2026-06-07T21:40:15 1780868415

I'm considering the possibility that it's good to break the prefix and cache because the LLM itself was rewarded (during post-training) with different prefixes/system prompts, each containing reasoning traces of the correct size.

I might be very very wrong though and LLMs disagree with me, insisting that cache is preserved and the system message doesn't have to change (even though it often contains effort level in context) if effort level changes across turns, and that all you have to do is tell the inference lib that parses think tags to early-close think tags that are too long.

simianwords · 2026-06-08T07:09:28 1780902568

This seems correct but again I would like to think post training could have been also done by checking only the string in the last message sent.

pyentropy · on Jan 28, 2025

If H800 is a memory-constrained model that NVIDIA built to avoid the Chinese export ban on H100 with equivalent fp8 performance, it makes zero sense to believe Elon Musk, Dario Armodei and Alexandr Wang's claims that DeepSeek smuggled H100s.

The only reason why a team would allocate time on memory optimizations and writing NVPTX code rather than focusing on posttraining is if they severely struggled with memory during training.

I mean, take a look at the numbers:

https://www.fibermall.com/blog/nvidia-ai-chip.htm#A100_vs_A8...

This is a massive trick pulled by Jensen, take the H100 design whose sales are regulated by the government, make it look 40x weaker and call it H800, while conveniently leaving 8-bit computation as fast as H100. Then bring it to China and let companies stockpile without disclosing production or sales numbers, and have no export controls.

Eventually, after 7 months, US govt starts noticing the H800 sales and introduces new export controls, but it's too late. By this point, DeepSeek has started research using fp8. They slowly build bigger and bigger models, work on the bandwidth and memory consumptions, until they make r1 - their reasoning model.

cyanydeez · on Jan 28, 2025

What's surprising is anyone would repeat Elon musk related things.

Tech or politics related, he's off the deep end.

mnky9800n · on Jan 28, 2025

Especially since he seems intent on everyone talking about him all the time. I find it questionable when a person wants to be the centre of attention no matter. Perhaps attention is not all we need.

K0balt · on Jan 28, 2025

Yet another casualty of laypersons browsing arXiv. That paper was like flypaper to his narcissism.

AnthonyMouse · on Jan 28, 2025

The problem is he's only wrong some of the time and then people arguing about which one it is this time generates attention, a valuable commodity.

m-s-y · on Jan 28, 2025

Maybe “some” applied in the past but his recent history might best be described as “almost always”.

Muromec · on Jan 28, 2025

Drugs. Dont do that much drugs for so long.

numpad0 · on Jan 28, 2025

He's like a broken smart network switch, smart as in managed. Packets with switch MAC on it are all broken, but erroneously forwarded ones often has valuable data. We through L3 don't know which one is which.

cyanydeez · on Jan 28, 2025

I'm wrong some of the times.

He's a lucky mensch, no more, no less.

schubart · on Jan 28, 2025

Interesting how people keep calling it “the Chinese export ban”. Isn’t an American export ban?

pyentropy · on July 7, 2024

You should start a blog... or maybe not - pursue the battle in academia/work and occasionally drop nuggets of wisdom like this somewhere. But do not delete them.

pyentropy · on June 13, 2024

I updated the post with a a link to counter-argument from Sabine Hossenfelder, the arguments from Zvi and three points from my side.

pyentropy · on June 13, 2024

I updated the post with a a link to counter-argument from Sabine Hossenfelder, the arguments from Zvi and three points from my side.

pyentropy · on June 13, 2024

Scott worked at OpenAI Safety and he likes it: https://scottaaronson.blog/?p=8047

But is the "-ed" in worked a problem?

pyentropy · on June 13, 2024

Thank you.

mistermann · on June 13, 2024

The irony of these conversations is bizarre.

As the (tautological) saying goes: everyone is doing their best. My interest is whether this can be improved - perhaps at some point when AI gets closer to challenging us for cognitive supremacy we will awake from our slumber.

pyentropy · on June 13, 2024

It is a question. I tried to put what my opinion is on a few statements but I absolutely cannot summarize 160 pages (Business Insider did using GPT, which I find insulting and funny) nor have a 100% opinion on something that involves national security, secrets and other stuff that I don't have access to.