This is the biggest news of the announcement. Prior Opus models were strong, but the cost was a big limiter of usage. This price point still makes it a "premium" option, but isn't prohibitive.
Also increasingly it's becoming important to look at token usage rather than just token cost. They say Opus 4.5 (with high reasoning) used 50% fewer tokens than Sonnet 4.5. So you get a higher score on SWE-bench verified, you pay more per token, but you use fewer tokens and overall pay less!
I run lmstudio for ease of use on several mac studios that are fronted by a small token aware router that estimates resource usage on the mac studios.
Lots of optimization left there, but the systems are pinned most of the time so not focused on that at the moment as the gpus are the issue not the queuing.
I would like to hear more about your set up if you’re willing. Is the token aware router you’re using publicly available or something you’ve written yourself?
It isn't open... but drop me an email and I can send you it. Basically just tracks a list of known lmstudios on the network, queries their models every 15 seconds and routes to the ones who have the requested models loaded in a FIFO queue tracking the number of tokens/model (my servers are uniform... m4 max 128gb studios but could also track the server) and routes to the one that has just finished. I used to have it queue one just as it was expected to finish but was facing timeout issues due to an edgecase.
This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.
Categorically different? Sure. A valid excuse to ban certain forms of linear algebra? No.
And before someone says it's reductive to say it's just numbers, you could make the same argument in favor of cryptographic export controls, that the harm it does is larger than the benefit. Yet the benefit we can see in hindsight was clearly worth it.
This is almost certainly the issue. It's very unintuitive for users, but LLMs behave much better when you clear the context often. I run /clear every third message or so with Claude Code to avoid context rot. Anthropic describes this a bit with their best practices guide [0].
This'd be a valid analogy if all compiled / interpreted languages were like INTERCAL and eg. refused to compile / execute programs that were insufficiently polite, or if the runtime wouldn't print out strings that it "felt" were too silly.
It depends from which vantage point you look at it. The person directing the company, let's imagine it was Bill Gates instructing that the code should be bug free, but its very opinionated about what a bug is at Microsoft.
> I don't know what it is, but trying to coax my goddamn tooling into doing what I want is not why I got into this field.
I can understand that, but as long as the tooling is still faster than doing it manually, that's the world we live in. Slower ways to 'craft' software are a hobby, not a profession.
(I'm glad I'm in it for building stuff, not for coding - I love the productivity gains).
Computer use is the most important AI benchmark to watch if you're trying to forecast labor-market impact. You're right, there are much more effective ways for ML/AI systems to accomplish tasks on the computer. But they all have to be hand-crafted for each task. Solving the general case is more scalable.
Not the current benchmarks, no. The demos in this post are so slow. Between writing the prompt, waiting a long time and checking the work I’d just rather do it myself.
For instance: I do periodic database-level backups of a very closed-source system at work. It doesn't take much of my time, but it's annoying in its simplicity: Run this GUI Windows program, click these things, select this folder, and push the go button. The backup takes as long as it takes, and then I look for obvious signs of either completion or error on the screen sometime later.
With something like this "Computer Use" model, I can automate that process.
It doesn't matter to anyone at all whether it takes 30 seconds or 30 minutes to walk through the steps: It can be done while I'm asleep or on vacation or whatever.
I can keep tabs on it with some combination of manual and automatic review, just like I would be doing if I hired a real human to do this job on my behalf.
(Yeah, yeah. There's tons of other ways to back up and restore computer data. But this is the One, True Way that is recoverable on a blank slate in a fashion that is supported by the manufacturer. I don't get to go off-script and invent a new method here.
But a screen-reading button-clicker? Sure. I can jive with that and keep an eye on it from time to time, just as I would be doing if I hired a person to do it for me.)
Have you tried AutoHotKey for that? It can do GUI automation. Not an LLM, but you can pre-record mouse movements and clicks, I've used it a ton to automate old windows apps
I've tried it previously, and I've also given up on it. I may try it again at some point.
It is worth noting that I am terrible at writing anything resembling "code" on my own. I can generally read it and follow it and understand how it does what it does, why it does that thing, and often spot when it does something that is either very stupid or very clever (or sometimes both), but producing it on a blank canvas has always been something of a quagmire from which I have been unable to escape once I tread into it.
But I can think through abstract processes of various complexities in tiny little steps, and I can also describe those steps very well in English.
Thus, it is without any sense of regret or shame that I say that the LLM era has a boon for me in terms of the things I've been able to accomplish with a computer...and that it is primarily the natural-language instructional input of this LLM "Computer Use" model that I find rather enticing.
(I'd connect the dots and use the fluencies I do have to get the bot to write a functional AHK script, but that sounds like more work than the reward of solving this periodic annoyance is worth.)
I'm really interested in the progress on computer use. These are the benchmarks to watch if you want to forecast economic disruption, IMO. Mastery of computer use takes us out of the paradigm of task-specific integrations with AI to a more generic interface that's way more scalable.
Maybe this is true? But it's not clear to me this methodology will ever be quite as good as native tool calling. Or maybe I don't know the benchmark well enough, I just assume it's vision based
Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.
But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug
Do you think a Genie like model specifically trained on data consisting of interacting with application interfaces would be good on computer use tasks?
Thus far I didn't have to worry about ChatGPT having bad incentives when giving me advice on product purchases. Now that "Merchants pay a small fee on completed purchases", will the model steer me towards ACP-supported retailers at a higher rate?
Presumably yes, but if I’m using an agent to make purchases, I’d prefer it to use sites where it can safely make a purchase anyway. The optimal UI would probably leave it up to me. One system should find products, independently of whether a vendor of them supports “ACP.” And I should be able to configure my agent to “only purchase where ACP is supported.” In the context of a conversation, I’d expect it to show me all the products it found, and offer to refine the list to include only ACP vendors.
reply