Textbook example of how to respond to your customers, kudos.

stingraycharles · 2026-04-07T06:04:25 1775541865

Is it?

I’m of the opinion that there’s more to it; obviously the thinking tokens aren’t having any reasonable impact on latency, given that bandwidth is hardly the bottleneck.

Seems more and more that Anthropic et al don’t want to give up their secret sauce / internals (which is their full right) and this is a step towards that direction, and it’s being presented as “reduces latency”.

dezgeg · 2026-04-07T09:07:38 1775552858

I've understood that in more recent models you need to run extra compute to get a human-readable version of the thinking tokens, so it does impact latency. Though probably the more important motive is you can squeeze in more concurrent users by skipping this.

stingraycharles · 2026-04-07T09:34:11 1775554451

No, that’s simply whether CoT is enabled or not. That actually does have impact.

What Anthropic is doing is still generating the thinking tokens (because they improve answer quality) without showing it to them. I believe this may actually hint at a future where these LLM vendors don’t want to show the internal reasoning like they do right now.

I’m very much of the opinion that hiding them from the response because it “improves latency” is nonsense.