Open source models costs are determined only by electricity usage, as anyone can rent a GPU qnd host them
Closed source models cost x10 more just because they can
A simple example is Claude Opus, which costs ~1/10 if not less in Claude Code that doesn't have that price multiplier
This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.
Also, vendors need to make a profit! So tack a little extra on as well.
However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.
A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...
Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.
When other models would grep, then read results, then use search, then read results, then read 100 lines from a file, then read results, Composer 1 is trained to grep AND search AND read in one round trip
It may read 15 files, and then make small edits in all 15 files at once
Just ask LLM to write one on top of OpenRouter, AI SDK and Bun
To take your .md input file and save outputs as md files (or whatever you need)
Take https://github.com/T3-Content/auto-draftify as example
reply