> Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on Deep...

KronisLV · 2025-10-12T13:06:20 1760274380

> I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code

At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.

Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.

meander_water · 2025-10-12T11:50:43 1760269843

Interesting, if you take a look at the median throughput chart [0], groq goes insane after 7th Oct. Wonder what happened.

[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance

sigmar · 2025-10-12T14:07:42 1760278062

2x jump overnight. new LPU hardware? I checked the speed for groq's gpt-oss-120B, Llama4-maverick, and Llama4-scout; none of them had a noticeable change this month

awestroke · 2025-10-12T11:54:52 1760270092

Heavy quantization

petesergeant · 2025-10-12T15:35:38 1760283338

They claim (or someone on Reddit who claims to be staff claims) that's not accurate: https://www.reddit.com/r/LocalLLaMA/comments/1mk4kt0/comment...

immortal3 · 2025-10-12T11:48:34 1760269714

There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130

rfoo · 2025-10-12T12:05:05 1760270705

> It's known that such tricks reduce accuracy

AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.

martinald · 2025-10-12T12:47:21 1760273241

No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.

Der_Einzige · 2025-10-12T14:16:01 1760278561

It’s quantization which is crippling accuracy…

petesergeant · 2025-10-13T07:17:52 1760339872

People all over this subthread saying that with no evidence provided. The company say they don’t — which would be pretty embarrassing to have to walk back — so who’s saying they do?

jsheard · 2025-10-12T11:56:18 1760270178

> Groq and Cerebras use custom chips

Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.

petesergeant · 2025-10-12T15:37:06 1760283426

This is an "expensive for whom" question. I'd be keen to know if they're burning investor money hosting these right now or if they're able to run these at cost.

senko · 2025-10-12T11:23:49 1760268229

> You'll see Groq averaging 1,086tps

What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...

OpenRouter numbers look fishy.

petesergeant · 2025-10-13T06:19:18 1760336358

Wonder if it’s prompt caching? OpenRouter is (I guess) just reporting actual throughput, where presumably groq is reporting a from-scratch figure? Just a guess tho.

jbellis · 2025-10-12T11:43:31 1760269411

groq is quantizing, even though it's not labeled as such on openrouter (super frustrating)

bn-l · 2025-10-12T12:41:30 1760272890

Do you have a source for that? They are pretty close to the ref implementation on moonshot’s ranking

jbellis · 2025-10-13T12:40:08 1760359208

https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe...

alecco · 2025-10-12T15:51:13 1760284273

But Groq/Cerebras are hardware accelerators. It's an unrelated optimization. I wouldn't be surprised if they could also use speculators (today or in the future).

Havoc · 2025-10-12T11:36:51 1760269011

>Groq and Cerebras often feel like the only games in town.

SambaNova should be similar...they've got a similar specialized hardware approach

p1esk · 2025-10-12T11:28:38 1760268518

Do these numbers compare performance at the same cost?

petesergeant · 2025-10-12T13:32:59 1760275979

You can see the cost in the links, and the answer is “pretty much” for the consumer. The backend maths, no idea.