More

alach11 · 2025-12-01T18:36:08 1764614168

Usually the first day or two are readily solvable in Excel with just regular spreadsheet formulas.

alach11 · 2025-11-24T19:13:14 1764011594

This is the biggest news of the announcement. Prior Opus models were strong, but the cost was a big limiter of usage. This price point still makes it a "premium" option, but isn't prohibitive.

Also increasingly it's becoming important to look at token usage rather than just token cost. They say Opus 4.5 (with high reasoning) used 50% fewer tokens than Sonnet 4.5. So you get a higher score on SWE-bench verified, you pay more per token, but you use fewer tokens and overall pay less!

alach11 · 2025-11-21T17:05:30 1763744730

Just curious - are you using Open WebUI or Librechat as a local frontend or are all your workflows just calling the models directly without UI?

nickreese · 2025-11-21T19:32:19 1763753539

I run lmstudio for ease of use on several mac studios that are fronted by a small token aware router that estimates resource usage on the mac studios.

Lots of optimization left there, but the systems are pinned most of the time so not focused on that at the moment as the gpus are the issue not the queuing.

grosswait · 2025-11-21T23:04:59 1763766299

I would like to hear more about your set up if you’re willing. Is the token aware router you’re using publicly available or something you’ve written yourself?

nickreese · 2025-11-22T01:11:56 1763773916

It isn't open... but drop me an email and I can send you it. Basically just tracks a list of known lmstudios on the network, queries their models every 15 seconds and routes to the ones who have the requested models loaded in a FIFO queue tracking the number of tokens/model (my servers are uniform... m4 max 128gb studios but could also track the server) and routes to the one that has just finished. I used to have it queue one just as it was expected to finish but was facing timeout issues due to an edgecase.

alach11 · 2025-11-18T17:13:22 1763486002

This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.

alach11 · 2025-11-13T23:15:13 1763075713

"Quantity has a quality all its own". It's categorically different to be able to do harm cheaply at scale vs. doing it at great cost/effort.

sodality2 · 2025-11-13T23:22:58 1763076178

Categorically different? Sure. A valid excuse to ban certain forms of linear algebra? No.

And before someone says it's reductive to say it's just numbers, you could make the same argument in favor of cryptographic export controls, that the harm it does is larger than the benefit. Yet the benefit we can see in hindsight was clearly worth it.

alach11 · 2025-11-07T16:29:01 1762532941

Can you cite your source for inference being at a loss? This disagrees with most of what I've read.

alach11 · 2025-11-04T00:14:14 1762215254

This is almost certainly the issue. It's very unintuitive for users, but LLMs behave much better when you clear the context often. I run /clear every third message or so with Claude Code to avoid context rot. Anthropic describes this a bit with their best practices guide [0].

[0] https://www.anthropic.com/engineering/claude-code-best-pract...

ryandvm · 2025-11-04T00:28:23 1762216103

Can we agree that this is no longer programming?

I don't know what it is, but trying to coax my goddamn tooling into doing what I want is not why I got into this field.

neximo64 · 2025-11-04T00:49:17 1762217357

I think this is what binary programmers/hole punching said about programming languages

jaennaet · 2025-11-04T01:20:23 1762219223

This'd be a valid analogy if all compiled / interpreted languages were like INTERCAL and eg. refused to compile / execute programs that were insufficiently polite, or if the runtime wouldn't print out strings that it "felt" were too silly.

Now there's an idea for an esoteric language.

neximo64 · 2025-11-04T06:39:38 1762238378

It depends from which vantage point you look at it. The person directing the company, let's imagine it was Bill Gates instructing that the code should be bug free, but its very opinionated about what a bug is at Microsoft.

tjansen · 2025-11-04T07:17:08 1762240628

> I don't know what it is, but trying to coax my goddamn tooling into doing what I want is not why I got into this field.

I can understand that, but as long as the tooling is still faster than doing it manually, that's the world we live in. Slower ways to 'craft' software are a hobby, not a profession. (I'm glad I'm in it for building stuff, not for coding - I love the productivity gains).

frenchie4111 · 2025-11-04T00:39:07 1762216747

(I agree we shouldn't call it programming)

Uhm – isn't "coax my goddamn tooling into doing what I want" basically all we did pre-LLMs anyway?

jaennaet · 2025-11-04T01:21:56 1762219316

How often did you IDE or editor refuse to do something it was generally capable of because it deemed the operation too frivolous in a context?

alach11 · 2025-10-08T01:23:31 1759886611

Computer use is the most important AI benchmark to watch if you're trying to forecast labor-market impact. You're right, there are much more effective ways for ML/AI systems to accomplish tasks on the computer. But they all have to be hand-crafted for each task. Solving the general case is more scalable.

poopiokaka · 2025-10-08T01:47:59 1759888079

Not the current benchmarks, no. The demos in this post are so slow. Between writing the prompt, waiting a long time and checking the work I’d just rather do it myself.

panarky · 2025-10-08T03:29:04 1759894144

It's not about being faster than you.

It's about working independently while you do other things.

ssl-3 · 2025-10-08T04:10:24 1759896624

And it's a neat-enough idea for repetitive tasks.

For instance: I do periodic database-level backups of a very closed-source system at work. It doesn't take much of my time, but it's annoying in its simplicity: Run this GUI Windows program, click these things, select this folder, and push the go button. The backup takes as long as it takes, and then I look for obvious signs of either completion or error on the screen sometime later.

With something like this "Computer Use" model, I can automate that process.

It doesn't matter to anyone at all whether it takes 30 seconds or 30 minutes to walk through the steps: It can be done while I'm asleep or on vacation or whatever.

I can keep tabs on it with some combination of manual and automatic review, just like I would be doing if I hired a real human to do this job on my behalf.

(Yeah, yeah. There's tons of other ways to back up and restore computer data. But this is the One, True Way that is recoverable on a blank slate in a fashion that is supported by the manufacturer. I don't get to go off-script and invent a new method here.

But a screen-reading button-clicker? Sure. I can jive with that and keep an eye on it from time to time, just as I would be doing if I hired a person to do it for me.)

thewebguyd · 2025-10-08T14:30:03 1759933803

Have you tried AutoHotKey for that? It can do GUI automation. Not an LLM, but you can pre-record mouse movements and clicks, I've used it a ton to automate old windows apps

ssl-3 · 2025-10-08T18:48:52 1759949332

I've tried it previously, and I've also given up on it. I may try it again at some point.

It is worth noting that I am terrible at writing anything resembling "code" on my own. I can generally read it and follow it and understand how it does what it does, why it does that thing, and often spot when it does something that is either very stupid or very clever (or sometimes both), but producing it on a blank canvas has always been something of a quagmire from which I have been unable to escape once I tread into it.

But I can think through abstract processes of various complexities in tiny little steps, and I can also describe those steps very well in English.

Thus, it is without any sense of regret or shame that I say that the LLM era has a boon for me in terms of the things I've been able to accomplish with a computer...and that it is primarily the natural-language instructional input of this LLM "Computer Use" model that I find rather enticing.

(I'd connect the dots and use the fluencies I do have to get the bot to write a functional AHK script, but that sounds like more work than the reward of solving this periodic annoyance is worth.)

redman25 · 2025-10-08T13:53:26 1759931606

They could literally run 24/7 overnight assuming they eventually become good enough to not need hand holding.

alach11 · 2025-09-29T17:10:32 1759165832

I'm really interested in the progress on computer use. These are the benchmarks to watch if you want to forecast economic disruption, IMO. Mastery of computer use takes us out of the paradigm of task-specific integrations with AI to a more generic interface that's way more scalable.

sipjca · 2025-09-29T18:09:24 1759169364

Maybe this is true? But it's not clear to me this methodology will ever be quite as good as native tool calling. Or maybe I don't know the benchmark well enough, I just assume it's vision based

Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.

But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug

simianwords · 2025-09-29T19:55:29 1759175729

Looks like RPA vs API debate all over again

cantor_S_drug · 2025-09-29T18:47:28 1759171648

Do you think a Genie like model specifically trained on data consisting of interacting with application interfaces would be good on computer use tasks?

mrshu · 2025-09-29T18:05:05 1759169105

What are some standard benchmarks you look at in this space?

alach11 · 2025-09-29T17:05:56 1759165556

Thus far I didn't have to worry about ChatGPT having bad incentives when giving me advice on product purchases. Now that "Merchants pay a small fee on completed purchases", will the model steer me towards ACP-supported retailers at a higher rate?

CyberMacGyver · 2025-09-29T20:35:43 1759178143

This will be worst for consumers, like DoorDash or Uber eats.

DoorDash takes 15-30% of fees from restaurants so restaurants raise their prices and consumers have to pay service fee and a delivery fee and tip.

Be ready to pay more at sites that have this enabled.

chatmasta · 2025-09-29T20:21:57 1759177317

Presumably yes, but if I’m using an agent to make purchases, I’d prefer it to use sites where it can safely make a purchase anyway. The optimal UI would probably leave it up to me. One system should find products, independently of whether a vendor of them supports “ACP.” And I should be able to configure my agent to “only purchase where ACP is supported.” In the context of a conversation, I’d expect it to show me all the products it found, and offer to refine the list to include only ACP vendors.