Hacker Newsnew | past | comments | ask | show | jobs | submit | sunaurus's commentslogin

Has anybody else noticed a pretty significant shift in sentiment when discussing Claude/Codex with other engineers since even just a few months ago? Specifically because of the secret/hidden nature of these changes.

I keep getting the sense that people feel like they have no idea if they are getting the product that they originally paid for, or something much weaker, and this sentiment seems to be constantly spreading. Like when I hear Anthropic mentioned in the past few weeks, it's almost always in some negative context.


Well, off the top of my head:

- Banning OpenClaw users (within their rights, of course, but bad optics)

- Banning 3rd party harnesses in general (ditto)

(claude -p still works on the sub but I get the feeling like if I actually use it, I'll get my Anthropic acct. nuked. Would be great to get some clarity on this. If I invoke it from my Telegram bot, is that an unauthorized 3rd party harness?)

- Lowering reasoning effort (and then showing up here saying "we'll try to make sure the most valuable customers get the non-gimped experience" (paraphrasing slightly xD))

- Massively reduced usage (apparently a bug?) The other day I got 21x more usage spend on the same task for Claude vs Codex.

- Noticed a very sharp drop in response length in the Claude app. Asked Claude about it and it mentioned several things in the system prompt related to reduced reasoning effort, keeping responses as brief as possible, etc.

It's all circumstantial but everything points towards "desperately trying to cut costs".

I love Claude and I won't be switching any time soon (though with the usage limits I'm increasingly using Codex for coding), but it's getting hard to recommend it to friends lately. I told a friend "it was the best option, until about two weeks ago..." Now it's up in the air.


> It's all circumstantial but everything points towards "desperately trying to cut costs".

I have been wondering if it's more geared at reducing resource usage, given that at the moment there's a known constraint on AI datacenter expansion capability. Perhaps they are struggling to meet demand?


It’s more that Anthropic knows that the models themselves are non-sticky, and the real moat is in the ecosystem around it.

It only makes sense for them to get users to use their ecosystem, rather than other tools.


See: Claude Cowork trying to establish an entire new group of people in their ecosystem.

And massive VM drive growth

> Perhaps Anthropic is struggling to meet demand?

Yes, definitely, they’re gracefully failing to meet demand. They could also deny new customers, but it would probably be bad for business.


I once decided to deny new customers in order to be able to service current demand at the quality we wanted. It backfired and made people want our product even more. Our phones were blowing up. That approach can have unintended consequences!

You unintentionally used a common sales tactic; by decreasing supply you increase demand.

Another knob you could have turned is: raise prices. Did you try this?

Anthropic is already doing this.

Signup prices seem higher now than three months ago.

This is actually the least frustrating method because people who can't afford to pay are not as angry as people who paid and aren't getting served (like when sign-in emails don't arrive for hours or days), or people who have paid for a long time to suddenly see quality decrease.

But it might not be best for business: Having more users than you can handle might suck, but if you're popular enough, people are still gonna put up with it.


Bad for business and probably unwise for the type of product people will pop their head in to check on, then stop paying and return much later to see whether it's still not much more than a parlor trick for them.

For my part, I've tried to help reduce their demand by cancelling my subscription.

I wish they would just rip the bandaid to stop everybody's entitled whining.

"We're sorry, what we were able to give you for $100/mo before now needs to be $200/mo (or more). We miscalculated/we were too generous/gave too much away for too little. It's a new technology, we are seeing a ton of demand, we are trying to run a business, hope you understand. If you don't want it, don't pay for it."


This is my take too, although I'm not prepared for a max400 reality to replace the max200, but... I hate all of the whingeing. Piggies at the buffet line seem to be the loudest on this subject.

I would understand the move, but boy would it play right into the "AI is only here to make the rich even richer" feeling wouldn't it?

If I strain really hard, I can come up with a reason why it might play into such a narrative.

/s


> "We're sorry, what we were able to give you for $100/mo before now needs to be $200/mo (or more). We miscalculated/we were too generous/gave too much away for too little. It's a new technology, we are seeing a ton of demand, we are trying to run a business, hope you understand. If you don't want it, don't pay for it."

Anthropic's thing has always been that they are perceived as slightly ahead of the competition, if they 2X their pricing then the competition that used to be "slightly worse" suddenly becomes an absolute bargain and guts their user base.


It is one thing to pay 100 a month to make calendar apps for your linkedin and birds on bicycles to get invited to talks, paying 200 HOWEVER

If we didn’t have the birds on bicycles, how would we know the models are getting better?

Are we at the point where there are external constraints that cash can't solve?

can't tell if you're being facetious but yes, there's not enough cash in the world to double energy/silicon fab capacity in a year. Infrastructure takes time, hardware is hard, and you have to be willing to bet that the demand will be there 5 years from now to make an investment today.

Until one has the entire supply of world GPU production, cash can solve it by out bidding others

TSMC would never allow all of their output to only one customer. You have an over simplified view of this.

One could always make existing infrastructure more efficient. Nothing better than post-mature optimization.

Just put everyone on pay per use with the API and rip the band aid off.

Even the pay per use is heavily VC subsidied at current prices.

All indications are that inference for API use is margin positive for Open AI and Anthropic not the subscription.

It will basically cut the hobbyist out and entrench large corporations that can pay the real costs.

If that happened and I was working for myself, I would just buy the beefiest computer I could finance and do everything locally.


Honestly, I wish they couldn’t subsidize with VC cash and such and offer below cost to begin with. Like I wish it were illegal. Basically this allows things like Uber, more or less putting taxis out of business and then being worse than what they replaced.

I’d like to see a lot more than entitled whining. I would like to see the fist of regulation slammed down on the back of these tech shenanigans where they know they’ll never be able to match the prices they’re starting with


I wish they would too. I’d respect them more for the transparency. I think everyone’s enshitiffication sensors have rightly been dialed up over the years. So without explanations for the regressions it just feels like another example

Claude -p is allowed. They're not going to give you a feature then ban you for using it.

What they changed is that it now uses extra usage, which is charged at api rates


"claude -p" does not charge api rates by itself, I just ran "claude -p 'write hello world to foo.txt'", and it didn't.

What they changed is that if you have OpenClaw run 'claude -p' for you, that gets your account banned or charged API rates, and if they think your usage of 'claude -p' is maybe OpenClaw, even if it's not, you get charged API rates or banned.

It seems so silly to me. They built a feature with one billing rate, and the feature is a bash command. If you have a bad program run the bash command, you get billed at a different rate, if you have a good script you wrote yourself run it, you're fine, but they have literally no legitimate way to tell the difference since either way it's just a command being run.

The justification going around is that OpenClaw usage is so heavy that it impacts the service for other people, but like OpenClaw was just using the "claude code max" plan, so if they can't handle the usage the plan promises, they should be changing the plan.

If they had instead said "Your claude code max plan, which has XX quota, will get charged API rates if you consistently use 50% of your quota. The quota is actually a lie, it's just the amount you can burst up to once or twice a week, but definitely not every day" and just banned everyone that used claude code a lot, I wouldn't be complaining as much, that'd be much more consistent.


It only switches to charging API rates if some part of your prompt triggers their magic string detector. Lot of examples of that floating around where swapping "is" for "are" or whatever will magically allow the request against your subscription plan again.

> (claude -p still works on the sub but I get the feeling like if I actually use it, I'll get my Anthropic acct. nuked. Would be great to get some clarity on this. If I invoke it from my Telegram bot, is that an unauthorized 3rd party harness?)

How often? Realistically, if you invoke it occasionally, for what's clearly an amount that's "reasonable personal use", then no you don't get nuked.


It’s the same problem people have with Google. If they ban you for some AI hallucinated reason you have no recourse other than going viral on Hacker News.

I haven't seen a single case of that happening with Anthropic yet. Every time someone has gotten banned it's because they either used third party harnesses which went to great lengths to impersonate claude code (obvious evasion), or because they set things up so it maxxed out their usage 24/7.

I'll change my mind when I see otherwise.

And this isn't being positive about Anthropic support or their treatment of users, as I too have seen lots of people here getting billed by them for stuff they never paid for, blatant fraud. That's even worse than Google. I'm only talking about getting banned for usage.


I plugged this question into Claude and told it to limit me to 10:

1. Cancer patient banned mid-paymenthttps://news.ycombinator.com/item?id=46675740

2. Hobbyist coder, VPN trigger, forms into void for 10+ monthshttps://news.ycombinator.com/item?id=47286867

3. "Reinstated" but still locked out — two systems out of synchttps://news.ycombinator.com/item?id=46007408

4. Banned for testing vision APIhttps://news.ycombinator.com/item?id=39988137

5. Banned on first ever prompt ("What do you know about Hacker News?")https://news.ycombinator.com/item?id=39698788

6. Mass banning wave, some banned before first usehttps://news.ycombinator.com/item?id=39672765

7. Entire company banned without warning, thousands of users strandedhttps://news.ycombinator.com/item?id=42210199

8. Forced new account (no email change support) → immediately bannedhttps://news.ycombinator.com/item?id=46339741

9. Banned for scaffolding a Claude.md file, support email never arriveshttps://news.ycombinator.com/item?id=46723384

10. $81 billing overcharge, human promised, month of silencehttps://news.ycombinator.com/item?id=47693679


Would've been nice if you'd read them as not a single one of them even mentions Agent SDK or claude -p usage, the topic of this thread.

I did read them but I interpreted the topic of this thread to be Anthropic's vague approach to compliance enforcement not specifically how claude -p is used and interpreted by Anthropic.

Huh I've noticed that! Opus hallucinates answers and provides contradictory reasoning to me much more regularly during the past couple of weeks than it did in February.

For example, after a question about a script that used taskset and CUDA_VISIBLE_DEVICES to handle NUMA domains on a parallel GLUE evaluation run:

>Round 2 tasks reuse GPU slots that Round 1 may still be occupying. Cola, SST-2, and QQP are small datasets and will likely finish before MNLI, but QNLI is sizable. If QNLI (GPU 3, slot 3) is still running when Round 2 launches, that's fine — they're on different slots. But more importantly, Round 2 uses slots 5/6/7 (GPUs 7/0/1), while Round 1 uses slots 0/1/2/3/4 (GPUs 4/5/2/3/6). So there's no GPU collision — you're good here actually. Disregard, this is clean.


They also screwed up the API token detection and also blocked a bunch of 1st party tool users for ~24h.

Support consisted of AI bots saying you did something stupid, you did something wrong, you were abusing the system, followed by (only when I asked for it explicitly) claiming to file a ticket with a human who will contact you later (and it either didn't happen or their ticket system is /dev/null).

(By the way this is the 2nd time I've been "please hold" gaslit by support LLMs this exact same way, the other being with Square)


For what it's worth. I invoked claude -p from a script, and my account was nuked immediately. DM'd Thariq from Anthropic who admitted it was a weird classifier and would look into it, but then he never followed up. Been 13 days since I've been banned now.

Very sad considering I got my whole company on Claude Code for them to just ban be like this, with no customer support response.


claude -p not working would be instant unsubscribe downgrade from Max to Pro and further drive my use of codex. I use both but overall have noticed I reach for Claude less than codex lately because claude keeps getting slower and slower (I have not noticed a drop off in quality, but I use it less and less so maybe I'm not in a good position to notice).

Generally I find codex and claude make a good team. I'm not a heavy user, but I am currently Claude Max 5x and ChatGPT Plus. Now that OpenAI has a $100 offering and I am finding myself using Claude less, I am considering switching to Claude Pro and ChatGPT Pro x5. The work hours restriction on Claude Max x5 really pisses me off.

I am not a heavy user. Historically I only break over 50% weekly one week a month and average about 30-40% of Max x5 over the entire month. I went Max because of the weekly limits and to access the better models and because I felt I was getting value. I need an occasional burst of usage, not 24/7 slow compute. But even for pay-as-you-go burst usage Anthropic's API prices are insane vs Max.

I have yet to ever hit a limit on codex so it's not on my mind. And lately it seems like Claude is likely to be having a service interruption anyway. A big part of subscribing to Claude Max was to get away from how the usage limits on Pro were causing me to architect my life around 5hr windows. And now Anthropic has brought that all back with this don't use it before 2pm bullshit. I want things ready to go when the muses strike. I'm honestly questioning whether Anthropic wants anyone who isn't employed as a software engineer to use their kit.

Anyway for the last month or so codex "just works" and Claude has been an invitation for annoyances. There was a time when codex was quite a bit behind claude-code. They have been roughly equal (different strength and weaknesses) since at least February (for me).


I might consider switching to codex from claude pro 20x but I need the post tool use, pre file write and post user message hooks. Waiting on codex to deliver.

- pre file write -> block editing code files without a task and plan of work

- post tool use -> show next open checkbox in the task to the agent, like an instruction pointer

- post user message -> log all user messages for periodic review of intent alignment

These 3 hooks + plain md files make my claude harness.


I use codex through the pi agent. It’s wonderful and easy to create whatever extension or hook you want!

I’d use it with Claude too if they hadn’t banned it…


Why couldn’t you use Claude code harness with codex? The requests can be proxied to OpenAI.

I am cooking up an abstraction that enables these hooks on codex. Would love to have you kick the tires.

Try pi

Anthropic has become shady as hell in less than a few weeks. The DoD Story and the overall popularity among developers got them a huge leap over OAI but i certainly won't renew my subscription with them. The Claude SDK feels like a constant fight against its own limitations compared to Codex and other Harnesses.

Why were third party harnesses banned? Surely they'd want sticking power over the ecosystem.

There’s the argument that Anthropic has built Claude Code to use the models efficiently, which the subscription pricing is based on.

Maybe there’s some truth to that, but then why haven’t OpenAI made the same move? I believe the main reason is platform control. Anthropic can’t survive as a pipeline for tokens, they need to build and control a platform, which means aggressively locking out everybody else building a platform.


Alternatively products like openclaw have an outsized impact on Anthropic's infrastructure for essentially no benefit to them. Especially when you're taking advantage of the $200 plan.

OpenAI has never shyed away from burning mountains of cash to try and capture a little more market share. They paid a billion dollars for a vibe coded mess just for the opportunity to associate themselves with the hype.


> Taking advantage of the $200 plan.

No, I'm paying $200 a month for a premium product that I expect premium service for. It's the single most expensive IT expense I have. Taking advantage my foot.


Can you imagine paying the actual cost of it, or a subscription cost that at least ballpark matched it? I don't think I have a single friend or acquaintance who realistically would.

You are simply a bit too entitled. It's not a premium product and honestly not that expensive in my opinion either (though that is going to depend on your location).

You are more than able to pay for API rates.


You may want to learn the difference between someone being able to pay API rates and someone willing to pay API rates. I'm sure many people on HN are able to pay API rates and almost all of them aren't willing to pay API rates. The providers know this hence why subscriptions exist. API is almost solely used by companies as almost no private person would be willing to pay that.

“You may want to learn” such choice way to introduce your position which is really not much of one.

If you are going to come and complain about a $200 subscription that gives you $400 worth of API tokens there is only so much room to complain. Only so many lemons can be squeezed. Hope that was a helpful for you.


A normal person pays $0-10 for an AI plan, maybe double that for a business.

$200 is premium.


It is not a premium service, it simply is buying you more tokens. Those $200 gives you at least $400 in API cost tokens.

Don't confused price with "premium service". It was not that long ago that folks would be spending $100-200 on their cable service bundle. You are buying a subsidized product when using the plan and the more you spend the more tokens you get, has nothing to do with being a premium service.


This is a messaging issue on their part, which I think is partially intentional.

It’s not unreasonable for people to expect the most expensive subscription plan to be “premium”. That’s how it works everywhere else. They typically have better margins on the premium plans, and the monthly payment gives them reliable cash flow at that higher margin.

You’re right that that’s not true at Anthropic (or really most AI providers). You’re not even really buying tokens because you get billed whether you use it or not, the tokens don’t carry over like buying API tokens, and they get to dictate what an acceptable way to use those tokens is. They are cheaper though, assuming you actually use them. Which Anthropic et al would really prefer you didn’t.


> It is not a premium service, it simply is buying you more tokens.

The cheap plans are usually semi-unlimited the same way but not as powerful. This isn't simply a matter of buying more tokens.

> It was not that long ago that folks would be spending $100-200 on their cable service bundle.

Compared to OTA that's premium, but more relevantly if most cable buyers are getting a hypothetical $10 bundle then the $100 one is a premium bundle.


Sorry still not sure what you’re going on about . The majority of LLM plans are simply a token purchase. The $200 account buys you nothing but tokens. It’s not a premium service, it’s simply more tokens. This is true for most of the companies out there.

The original comment was they are paying for a premium service. No they are paying for more tokens. You lot going on and on arguing over some small hill.


The lower tier openai and google plans don't have access to the same models. Where are you seeing popular plans that are simply token purchases?

I guess if you want to go that deep sure they sometimes offer early access, access to new agents/models but ultimately it’s a function of tokens. The selling point for most/all providers is x times the usage. You are upgrading for the token access.

Claude was the topic at hand and higher tiers buy you more tokens. I know some like Gemini bundle a ton of junk alongside the tokens but you really are still buying yourself more tokens. There is nothing premium in a $200 Claude account. You are buying more tokens, $100 is the same as $200 except token count. Hope that helps. ;)


> $100 is the same as $200 except token count

But I was making an argument about the $10 plans, not the $100 plans.

Claude doesn't even go that low. Except the free plan which has a very reduced feature list.

Claude's $20 and $100 are pretty similar except tokens, that part is true. So they're a bit higher priced and more of the "it's just tokens" model. But the market as a whole is mostly selling a limited feature set down at lower price points. On average, getting up to the point where you have full access and are paying per-token is itself a premium jump.


You are standing on top of an ant hill and I still don’t fully understand your position. The original post was about the premium service Anthropic plans. There is no such thing, you are simply paying for more tokens. Hope that helps.

Any $200/month AI plan is premium. Hope that helps.

I know why I typically don’t respond to your posts. So much said and I am still not sure your point. You have ignored the original point and gone off on a tangent.

It is not a premium service that deserves special care which was what the original commenter stated. It is a $200 account that buys you $400+ on tokens.

Hope that helps recenter this weird path we are following. :)


> gone off on a tangent

What? What I just said was my one and only point from the very beginning. The price is so much higher than the median that that makes it premium and deserving of some special care.

I understand your point of view here, and it's fine if you disagree with mine but it's weird if you don't at least understand my point by now. You saying my last comment is a tangent suggests you don't understand me. But it's a simple point and I'm not sure how to make it clearer.

Does that help recenter?


It's not a premium product. It's just expensive.

> They paid a billion dollars for a vibe coded mess just for the opportunity to associate themselves with the hype.

Lol no they didn't. It wasn't even an acquihire. They just hired Peter.

Maybe they are paying him incredibly well, but not a billion dollars well.


I think it's a training data thing. They can only gather valid training data from real human interactions, so they don't want to subsidize tokens for purely automated interactions.

> Why were third party harnesses banned? Surely they'd want sticking power over the ecosystem.

Third-party harnesses are the exact opposite of stickiness!

Ditching Claude Code for a third party harness while using the Claude Code subscription means it's trivial to switch to a different model when you {run out of credits | find a cheaper token provider | find a better model}.


Note that the thing that's banned is using third party harnesses with their subscription based pricing.

If you're paying normal API prices they'll happily let you use whatever harness you want.


To be clear they weren’t banned from Claude usage, they were required to use the API and API rates rather than Claude Max tokens.

Claude code uses a bunch if best practices to maximize cache hit rate. Third party harnesses are hit or miss, so often use a lot more tokens for the same task.


nah this doesn't explain it.

most of the users of those third party harnesses care just as much about hitting cache and getting more usage.


I'm watching a conference talk right now from 2 weeks ago: "I Hated Every Coding Agent So I Built My Own - Mario Zechner (Pi)", and in the middle he directly references this.

He demonstrates in the code that OpenCode aggressively trims context, by compacting on every turn, and pruning all tool calls from the context that occurred more than 40,000 tokens ago. Seems like it could be a good strategy to squeeze more out of the context window - but by editing the oldest context, it breaks the prompt cache for the entire conversation. There is effectively no caching happening at all.

https://youtu.be/Dli5slNaJu0


Sure. The question is whether they have the same level of expertise and prioritization that Anthropic does.

They are working with the same tools and knowledge like Anthropic does as Caching practices are documented. And they have as much incentive as Anthropic does to not waste compute. Can we stop acting like people who build harnesses be it Opencode oder Mario Zechners Pi are dumbfucks who don't understand caching?

but claude -p is still Claude Code

Was something using that been banned?

Yep, that's the reason for the new Extra Credit feature in Claude Code. Some people were wiring up "Claude -p" with OpenClaw, so now Anthropic detects if the system prompt contains the phrase OpenClaw, and bills from Extra Credit if that happens:

https://x.com/steipete/status/2040811558427648357

"Anthropic now blocks first-party harness use too

claude -p --append-system-prompt 'A personal assistant running inside OpenClaw.' 'is clawd here?'

→ 400 Third-party apps now draw from your extra usage, not your plan limits.

So yeah: bring your own coin "


https://xcancel.com/bcherny/status/2041035127430754686#m

> This is not intentional, likely an overactive abuse classifier. Looking, and working on clarifying the policy going forward.


One thing is lack of control of token efficiency on what’s already a subsidised product.

Another thing is branding: Their CLI might be the best right now, but tech debt says it won’t continue to be for very long.

By enforcing the CLI you enforce the brand value — you’re not just buying the engine.


Claude Code was the best harness from roughly around release to January this year. Ever since then, it's become more and more bloated with more and more stuff and seemingly no coherent plan or vision to it all other than "let's see what else that sounds cool we can cram in there."

What's taken over since then? Codex or something else?

Pi.dev

Maybe they should fix bugs like this then https://github.com/anthropics/claude-code/issues/17979#issue... ...

I want to differentiate 2 kinds of harnesses

1. openclaw like - using the LLM endpoint on subscription billing, different prompts than claude code

2. using claude cli with -p, in headless mode

The second runs through their code and prompts, just calls claude in non-interactive mode for subtasks. I feel especially put off by restricting the second kind. I need it to run judge agents to review plans and code.


> (claude -p still works on the sub but I get the feeling like if I actually use it, I'll get my Anthropic acct. nuked. Would be great to get some clarity on this. If I invoke it from my Telegram bot, is that an unauthorized 3rd party harness?)

100% this, I’ve posted the same sentiment here on HN. I hate the chilling effect of the bans and the lack of clarity on what is and is not allowed.


In this case, they handled things pretty well. You can still use openclaw etc with your regular Anthropic subscription, it will just count towards your extra credits / usage which you can buy for a 30% discount compared to API pricing. And they gave everyone one month’s value in credits.

I don’t think they could have done that much better I’d say.


That does not address joshstrange's concerns.

There is very poor clarity about what is and isn't allowed with the Claude SDK/claude -p. Are we allowed to use it to automate stuff? What kind of tasks is it permitted to be used for? What if you call your script 'OrangeClaw' and release that on GitHub? What if your script gets super popular, does it suddenly become against TOS?


This is exactly my point. At what point does it become a ToS violation? Right now it's a huge grey area and the idea of getting my account banned because I crossed an invisible line with zero recourse other than to switch providers is... frustrating.

It's pretty easy to read between the lines tbh. Personal, non-automated use is fine. Using it as a means to automate depleting your 5-hour limit 24/7 ("leftover usage") is not fine. They don't want to put in in the ToS because it's almost impossible because writing what I just said will still have people going "well what's automated, where's the exact line!" when it's all pretty clear what the intended use case here is. The Anthropic peeps have said about as much.

I get that the traditional dev is allergic to the concept of reading between the lines and demands everything to be spelled out explicitly, but maybe you should just see it as something to learn because it's an incredibly useful life skill.


Ok, let's say I'm not using it to deplete leftover usage, the task just happens to run down the 5 hour window usage.

Are you willing to bet your account over whether you've read between the lines correctly? Anthropic aren't going to listen to appeals.


> the task just happens to run down the 5 hour window usage.

In a single prompt? From zero usage? That doesn't "just happen".


When you're using the SDK, yes it can. Example: I used the Python SDK to translate a bunch of source code recently. I spawned a subagent for each module that needed translating and left it to run for a few hours with a parallelism limit of 5. It blasted through the 5 hour usage and dug into extra usage credits.

I have zero assurances that the above can't result in a ban. The usage pattern is not distinct from OpenClaw.


As I said, it doesn't just happen, you explicitly had to set it up so it could happen.

I'm confused about this comment.

The GP has described a task which feels like a task very well within intended usage of CC, but can easily eat up the usage limit.

What should we read between the lines about this scenario?

Is it a bannable offense?


Just in case it wasn't clear, what they described doesn't need extra tooling. You can write this in your CLI and it will easily cap a Max 20x plan in an hour: "we are converting this entire codebase from TS to C#. Following the guidelines I've written in MIGRATION.md, convert each file individually. Use up to 32 parallel subagents. Track your work for each file in a PROGRESS.md file, which you will update for each file starting and completing. Using an agent team, as a secondary step, add a verification layer where you verify each file individually for accurate migration following the instructions in VERIFICATION.md"

Yea there are other ways to do this, you can set up a separate harness sure to make it more efficient, but just the above will also work, it's just text you paste into your CC terminal, and it will absolutely cap the largest subscription plan available no problem.


That "non-automated" part is where I feel like there is a lack of clarity. They even have some stuff in to allow for scheduling in Claude Code. Seems similar to a cron but "non-automated" would rule out using a cron (right?). I'd love to feel comfortable setting up daily/hourly tasks for Claude Code but that feels iffy. Like I said, I don't think the line is clear.

The lack of clarity doesn't matter because they obviously can't tell if you ran a claude -p a few times today with usual prompts or whether your cron job did. It's impossible for them to reliably tell.

It can tell if your cron is running them every 10 minutes 24/7, because basic biology rules out you doing that for more than a day or so.


Wait, this is news to me. I thought 3rd party use of the sub was unequivocally prohibited?

If I'm understanding you correctly: they changed that policy, you can now use 3rd party software unofficially with the undocumented Claude Code endpoint, and their servers auto-detect this and charge you extra for it?

EDIT: Yeah, something like that?

> Starting April 4 at 12pm PT / 8pm BST, you’ll no longer be able to use your Claude subscription limits for third-party harnesses including OpenClaw. Instead, they’ll require extra usage.

https://news.ycombinator.com/item?id=47633568

This seems to mean that unauthorized usage of the sub endpoint is tolerated now (and billed as though it were the regular API). And possibly affects claude -p, though I don't know yet.


> If I'm understanding you correctly: they changed that policy, you can now use 3rd party software unofficially with the undocumented Claude Code endpoint, and their servers auto-detect this and charge you extra for it?

That’s correct. It’s more like a convenience technicality: you can use your sub account, but you’re paying extra. So it doesn’t really count towards your subscription in any way.

Subscriptions can buy extra credits against a 30% discount, though, so it’s a decent amount cheaper than actual API, but still prohibitively expensive.


One month’s value in credits does not equal the value of one month’s subscription. They could have done better.

Perhaps Anthropic should put a freeze on new signups until they can increase capacity. This is the best kind of problem for a business, I'm cheering for them.

If there is one thing that is crystal clear, its that LLM providers will always take your money, no matter how bad the service is.

This requires ethics.

I think we are about a month away from a class action lawsuit, at their revenue they are a juicy target. And god knows they got the entirely self inflicted unholy combination going on, marketing & sales that borders on fraud (X times the usage of plan Y which has Z times of free tier which has unknowable "magic tokens") and then of course the actual fraud, reducing usage in fifteen different non obvious non public ways.

I will say I have noticed none of these things in my enterprise account. Is this is a known targeting of non-enterprise clients only?

>> apparently a bug?

it's a bug only if they get a harsh public response, otherwise it becomes a feature


A bug for one side can be a feature for another

i dont know why ppl are surprised. you just need to see what they say on china, open source and fake safety blogs to understand they re not a company that devs should give their code for free to

> claude -p still works on the sub but I get the feeling like if I actually use it, I'll get my Anthropic acct. nuked

I've used it with a sub a lot. Concurrency of 40 writing descriptions of thousands of images, running for hours on sonnet.

I have a lot of complaints. I've cancelled my $200 subscription and when it runs out in a few days I'll have to find something else.

But claude -p is fine.

... Or it was 2 week ago. Who knows if they've silently throttled it by now?


The other day I read that letting another agent invoke claude -p was considered a violation (i.e. letting OpenClaw delegate to Claude Code).

Not sure how that's enforced though. I was in OpenClaw discord a while ago and enforcement seemed a bit random.

I'll try to find the source, I might have gotten the details mixed up.


It’s not a “violation” but they said it would be charged as extra usage.

This is a funny cat and mouse game. They offer a built in loop command.

Just tmux and use that.

Soon if they drop -p people will just vibe code in 5 minutes a way to type inside it remotely similar to their own built in remote access tool. Seems like a losing game from anthropics side


Most of those are issues are coming from a very small minority. A lot of times its good for businesses to focus on the customers that are driving them the highest margin, most likely not users like yourself.

1) Nobody should expect to use OpenClaw without API usage.

2) We have known for a long time that the plans are subsidized. It was not as big of a deal but now that demand has continued to explode at a multiple and tools like OpenClaw were creating a lot of usage from a small minority of customers, prices change.

Everything for me points more towards, we have made a service people really want to use and we are trying to balance a supply shortage (compute) with pricing. Nothing is stopping folks like yourself from simply paying the API rates. It is the simple no hassle way to get around any issue you are having, pay the API cost and you will have no limitations!


A month ago the company I work at with over 400 engineers decided to cancel all IDE subscriptions (Visual Studio, JetBrains, Windsurf, etc.) and move everyone over to Claude Code as a "cost-saving measure" (along with firing a bunch of test engineers). There was no migration plan - the EVP of Technology just gave a demo showing 2 greenfield projects he'd built with Claude Opus over a weekend and told everyone to copy how he worked. A week later the EVP had to send out an email telling people to stop using Opus because they were burning through too many tokens.

Claude seems to be getting nerfed every week since we've switched. I wonder how our EVP is feeling now.


Pretty bad decision on his part. I've been telling other engineers within my company who felt threatened by AI that this would happen. That prices would rise and the marginal cost for changes to big codebases would start to exceed the cost of an engineer's salary. API credits are expensive, especially for huge contexts, and sometimes the model will use $200 in credits trying to solve a problem that could be fixed in an hour by a good engineer with enough context.

It kind of reminds me of the joke where a plumber charges $500 for a 5 minute visit. When the client complains the plumber says it's $50 for labor and $450 for knowing how to fix the problem.


A good lesson for all - I always really liked the Picasso version:

In a bustling restaurant, an excited patron recognized the famous artist Picasso dining alone. Seizing the moment, the patron approached Picasso with a simple request. With a plain napkin and a big smile, he asked the artist for a drawing. He promised payment for his troubles. Picasso, ever the creator, didn’t hesitate. From his pocket, he produced a charcoal pencil and he brought to life a stunning sketch of a goat on the napkin—a clear mark of his unique style. Proudly, he presented it to the patron.

The artwork mesmerized the patron, who reached out to take it, only to be stopped by Picasso’s firm hand. “That will be $100,000,” Picasso declared.

Astonished, the patron balked at the sum. “But it took you just a few seconds to draw this!”

With a calm demeanor, Picasso took back the napkin, crumpled it, and tucked it away into his pocket, replying, “No, it has taken me a lifetime.”


[flagged]


A good engineer and / or a tenured engineer could very well be compared to Picasso in this story. A tenured engineer did not just sit their entire career drawing that painting on the napkin, they delivered other results too. But at the end of it, they are able to deliver a Picasso at a moment's notice.

It actually matches up well with the current AI scene, except backwards. We use these model which cost ridiculous amounts of money to train, and all of that effort goes into producing the outputs we use, but we're paying something not too far above the marginal cost of inference when we use them.

So not applicable at all

Extremely applicable to illustrate the difference between people (time is precious, training and experience amortize across a relatively small amount of paid work) and software (can replicate infinitely, time is cheap, startup costs can amortize across billions of hours of paid work).

>That prices would rise

Competition will prevent that from happening. When anyone can host open models and there is giant demand for LLMs companies can not easily raise token prices without sending a lot of traffic to their competitors.


> When anyone can host open models

They'd still need to pay the actual power costs.


I didn't say that inference would be free, but that everything to do inference is a commodity which means that competition is easy to do.

It seems very unlikely that prices would rise in the long term. Yes, RAM and GPU prices are suddenly going up due to the demand spike and OpenAI's shenanigans, but I doubt it's going to last very long. Some combination of new capacity and reduced demand will most likely put things back on the usual course where this stuff gradually gets cheaper over time. And models are getting better, so next year you can probably get the same results for less compute. That $200 in credits becomes $150, then $100, then....

> the model will use $200 in credits trying to solve a problem that could be fixed in an hour by a good engineer with enough context

So the price for fixing the problem is equal. Sounds like a great argument for AI.


99% of software developers earn less than 200 USD a hour

That “with enough context” is doing a lot of work here. If you take a great engineer, drop them in front of an unfamiliar codebase, it’ll take them more than an hour to do most non-trivial tasks.

Most good engineers are way cheaper than that. The world is bigger than the united states.

Equal sounds like a terrible argument given all the other problems with replacing engineering thought with ai. I don't know where the line is but I expect it's far beyond equal AND there needs to be a level of "this can debug effectively in production" before that makes any sense for a real business case.

Even if you take it as true that prices have risen recently, and may continue to rise as the VC subsidies dry up, they will fall again long-term. Inference will get more power efficient with model-on-chip solutions like Taalas and God willing we will get cheaper and cheaper renewable energy.

Despite this I don't think engineers should feel threatened. As long as there is a need for a human in the loop, as today, there will still be engineering jobs. And if demand for engineering effort is elastic enough, there could easily be even more jobs tomorrow.

Rather than threatened, I think engineers should feel exposed. To danger, yes, but opportunity as well.


Increased demand will not drive down energy costs.

Of course not necessarily, but I keep seeing articles about how wind and especially solar power just keep getting cheaper.

Why not?

I can’t believe how many small to mid size companies are being destroyed by bad decisions like this.

A friend’s company fired all EMs and have engineers reporting to product managers. They aren’t allowed to do refactors because the CTO believes the AI doesn’t need organized code.


How do people like that ascend to CTO?

CTO is in many cases a rank more than a role, and given out accordingly. You should never take someone seriously based on their rank alone, much less a CTO.


Or more cynically they reach their level of competence, go one level further and stay there to keep them from ruining the productivity of the people doing the work...

He must be feeling pretty good, after all he still believes that it was the right call, and he definitely won't be admitting a mistake.

There's 0 chance of him facing the consequences for it either.


But cancelling IDE subscriptions? You need a proper IDE to along side AI augmented development unless you want to simply be along for the ride.

Well, you can resubscribe in an afternoon. The fired workers? No real recovery from that.

`git diff` is probably all you need.

Free VS Code is probably fine

I'm using the JetBrains IDE's and it's definitely worth paying for, even in the age of AI.

These are like $20-50 subs, you’re probably paying your dev a hell of a lot more. Let them use the tools they want. I spend almost all of my time in Emacs or Cursor, but I still haven’t found a database client that I like better than Datagrip.

A database client better than Datagrip is a tough one, yet I'm attempting to do just that [1] :).

I'm in month 4 of development, working on it full-time.

[1] https://seaquel.app


Hopefully that EVP feels embarrassed that a big bet was made that not only didn't pay off but left the company in a worse position. Some schadenfreude may be all you can expect, since this is an executive.

Wow, that sucks. Getting Claude for everyone wasn’t even the stupid thing, it was thinking that a shiny new hammer meant you could throw away all your wrenches.

Should have started slowly instead of being so aggressive with it.

lol. dude is so incompetent. changing tool for cost cutting is so stupid, we all know real cost cutting is firing people. if he is really good at he's doing, just fire 10% people and replace them with his Claude. If that didn't get backfired in 3 months, he will be CT0.

Wow, that sounds like you have a astoundingly terrible EVP.

I certainly noticed a significant drop in reasoning power at some point after I subscribed to Claude. Since then I've applied all sorts of fixes that range from disabling adaptive thinking to maxing out thinking tokens to patching system prompts with an ad-hoc shell script from a gist. Even after all this, Opus will still sometimes go round and round in illogical circles, self-correcting constantly with the telltale "no wait" and undoing everything until it ends up right where it started with nothing to show for it after 100k tokens spent.

Whether it's due to bugs or actual malice, it's not a good look. I genuinely can't tell if it's buggy, if it's been intentionally degraded, if it's placebo or if it's all just an elaborate OpenAI psyop.


The real question I see nobody asking is how GPT-5.4 beats Opus at a fraction of the price. I doubt it’s only a question of subsidization. My impression from the past is that GPT-5 was around a Sonnet-sized model, and 5-mini was Haiku-sized. At least on my codebase anyways, Codex one-shots tricky things that Opus needs several tries to fully get right.

IMO it doesn’t handily beat it.

It’s typically equivalent, sometimes better, sometimes behind. Better at following a well defined plan, less good at concept exploration and planning imo.

At 1m context it’s basically the same price.


I wanted to choose Anthropic because they were apparently more ethical compared to OpenAI, but... Yeah.

Right now the only blocker for me is the lack of Linux support.


Cursor?


Yes, I commented on it and applied all remedies suggested.

https://news.ycombinator.com/item?id=47664442

Configuration and environment variables seem to have improved things somewhat but it still seems to be hit or miss.


That issue now is closed, probably as "not planned".

Just anecdotal, but I was using Claude Code for everything a few months ago, and it seemed great. Now, it is making a ton of mistakes, doing the wrong thing, misunderstanding context, and just generally being unusable.

I now have been using Codex and everything has been great (I still swap back and forth but generally to check things out.)

My theory is just that the models are great after release to get people switching, then they cut them back in capabilities slowly over time until the next major release to increase the hype cycle.


Is it the models themselves or the tools around them? There's that patch[1] that floats around for Claude Code that's supposed to solve a lot of these problems by adjusting its tool-level prompts. Also, if it were the models themselves, wouldn't Cursor users have the same complaints (do they? I haven't heard anything but the only Cursor users I talk to are coworkers)?

I think it's more likely they're trying to optimize the Claude Code prompts to reduce load on their system and have overcorrected at the cost of quality.

1: https://gist.github.com/roman01la/483d1db15043018096ac3babf5...


Yeah, shorter time frame but I've been noticing that too. Just the other day I was experimenting with some workflow stuff. "Do x and y and run tests and then merge into develop."

Duly runs, and finishes. "All merged into develop".

I do some other work, don't see any of this, double check myself, I'm working off of develop.

"Hey, where is this work?"

"It is in this branch and this worktree, as you would expect, you will need to merge into develop."

"I'm confused, I asked you to do that and you said it was done."

"You're right and I did say that but I didn't do it. Shall I do it now?"

There's like this really weird balancing act between managing usage, but making people burn more tokens...


Prompt cache expired?

Perhaps, but I also don't see how a prompt cache miss or expiry should result in Claude stating affirmatively that it did something that it did not.

Part hypecycle, part desperate attempts to rein in usage

People keep saying this, but I’m not sure I buy it.

I was using both Codex and Claude Code heavily on some projects this weekend.

In one project Codex was screwing everything up and in another one absolutely killing it. I’ve seen the same from Claude.

In the bad Codex example it had the wrong idea and kept trying to figure out how to accomplish the same thing no matter how many times I attempt to correct it. Undoing the recent changes where it went down the wrong path was the only way to get things back on track.

I wonder if context poisoning is a bigger problem than people realize.


Yeah I’ve seen this too. It’s difficult for me to tell if the complaints are due to a legitimate undisclosed nerf of Claude, or whether it’s just the initial awe of Opus 4.6 fading and people increasingly noticing its mistakes.

It's not just you, there is a github issue for it: https://github.com/anthropics/claude-code/issues/42796

Just one more anecdote:

I'm on the enterprise team plan so a decent amount of usage.

In March I could use Opus all day and it was getting great results.

Since the last week of March and into April, I've had sessions where I maxed out session usage under 2 hours and it got stuck in overthinking loops, multiple turns of realising the same thing, dozens of paragraphs of "But wait, actually I need to do x" with slight variations of the same realisation.

This is not the 'thinking effort' setting in claude code, I noticed this happening across multiple sessions with the same thinking effort settings, there was clearly some underlying change that was not published that made the model get stuck in thinking loops more for longer and more often without any escape hatch to stop and prompt the user for additional steering if it gets stuck.


Whenever I see Opus say “but wait, …”—which is all the time—I get a little bit closer toward throwing my computer out the window. Sometimes I just collapse the thinking section, cross my fingers, and wait for the answer. It’s too frustrating watching the thinking process.

I stop the thinking and manually correct with explicit instructions or direction. I treat my agents like well meaning ivy-league graduate interns. They lack the experience to know what to do sometimes and need a “common sense” direction every now and then.

Have you considered just… writing code? Like we used to in the good old days? If the tool drives you to that point of frustration, maybe it’s time to give the tool a break.

A lot of folks aren't "allowed" to write code anymore.

Same experience, here. Very hard to base on facts, because every problem and prompt is an individual use-case and measuring agent reasoning quality is notoriously difficult anyway. But I spend a lot of time with Claude and my overall "feeling" fully matches your description. Quality has deteriorated, thinking takes longer and results become shallow. Something is off...

I’ve seen the point raised elsewhere that this could be the double usage promo that was available from the 13th of March to the 28th. ie. people getting used to the promo then feeling impacted when it finished.

Although it seems that enterprise wasn’t included, so maybe not in your case.

https://support.claude.com/en/articles/14063676-claude-march...


its sounds like, tinfoil hat, they reduced the quant size of their model and tried to mask the change with the promo. your theory only addresses the spend not the reduced realiability

To me, doubling session usage always seemed like a way to gaslight users into thinking their perception of smaller usage limits after that period ended was just them readjusting to the normal usage limits. Whether from a different model being used or an intentional reduction in weekly usage, I've noticed a difference.

I'm also an enterprise user and this has been my experience exactly. Same asks, same code bases, same models, much worse results. Everyone on my team is expressing the same thing.

Not only that, but the lack of transparency about what's happening, in clear and simple terms, directly from Anthropic is concerning.

I've already told my org's higher ups that in the current situation we're not close to getting our money's worth with these models.


this timing matches my experience, enterprise plan, but using opus from vscode - finished a heavy refactor of a large C# codebase mid march, tried to do basically the same thing early april and couldn't

It's probably because you didn't specify "make no mistakes" /s

In all seriousness though, I've observed the same thing with my own usage.


Both can be a thing at same time

I think there's a much more nefarious reason that you're missing.

It's pretty clear that OpenAI has consistently used bots on social networks to peddle their products. This could just be the next iteration, mass spreading lies about Anthropic to get people to flock back to their own products.

That would explain why a lot of users in the comments of those posts are claiming that they don't see any changes to limits.


The trouble with that argument, though, is that it works the other way as well: how do I, a random internet citizen, know that you're not doing the same thing for Anthropic with this comment?

(FWIW I have definitely noticed a cognitive decline with Claude / Opus 4.6 over the past month and a half or so, and unless I'm secretly working for them in my sleep, I'm definitely not an Anthropic employee.)


Oh it's pretty clear to me that Anthropic employs the same tactics and uses bots on socials to push its products too. On Reddit a couple of months ago it was simply unbearable with all the "Claude Opus is going to take all the jobs".

You definitely shouldn't trust me, as we're way beyond the point where you can trust ANYTHING on the internet that has a timestamp later than 2021 or so (and even then, of course people were already lying).

Personally I use Claude models through Bedrock because I work for Amazon, and I haven't noticed any decline. Instead it's always been pretty shit, and what people describe now as the model getting lost of infinite loops of talking to itself happened since the very start for me.


https://isitnerfed.org/

in short, it looks like nothing has been nerfed, but sentiment has definitely been negative. I suspect some of the openclaw users have been taking out their frustrations.


That's fascinating.

Any idea what their test harness looks like? My experience comes primarily from Claude Code; this makes me wonder if recent CC updates could be more to blame than Opus 4.6 itself.


Judging from the number of GitHub issues on Anthropic, shamelessly being dismissed as "fixed", I doubt openai needs the bots to tarnish that competitor.

There's still plenty of "leave my fellow multbillion corp alone" type ones,it means that corp can and should screw it's loving customer base harder.

The enshittification meme has been taken too seriously to the point where it is shoehorned into every single place possible.

It is not in the interests for Anthropic to screw its customer base. Running a frontier lab comes with tradeoffs between training, inference and other areas.


The investors are their customers - not the users of the end-product.

This shows a lack of understanding of how markets work. Investors make money when the valuation of the company increases. The valuation of the company is the best prediction of future profit risk adjusted.

How would anthropic increase future profits without satisfying customers?


Early investors make money when later investors buy them out at inflated valuations.

Well sure, all market signals should be considered. As a casual observer, my received signals have been indicating that AI is getting sold at a loss to get market share, and more recent signals have indicated that users are really really sensitive to both costs and performance.

The weakest signal to me is investor money, because when you think of it, investors are betting on a future that may or may not be there. Heck even trends aren't guaranteed, "past performance is no guarantee etc etc"


Have you seen the business models for these companies? Literal underpants gnome memes. OpenAI's goes like this:

1. Build AGI

2. Use said AGI to tell us how to become profitable

3. Profit!

Anthropic seems to be going all in on enterprise sales. Which means they don't actually have to please customers, or it's what ThePrimeagen humorously calls a "yacht problem"—a problem that only needs a solution after the IPO. For now all they have to do is convince corporate leadership that this is the future of work and sow enough FOMO to close those sales contracts and their projected sales, and stock valuation, goes through the roof.

Of course that value will collapse if they go without delivering on their promises long enough. That's why they call it a bubble. But by then, hopefully, Dario and the early investors will be long gone and even richer than they were to start. Their only competitor, OpenAI, is confronted with the same issues: the scalability problems won't go away, and addressing them doesn't drive stock valuation the way promising high rollers that AGI and total workforce automation are just around the corner does.


It doesn't matter if it is in Anthropic's interest to screw its customer base, if their reported monthly revenue growth is accurate then it makes perfect sense why Claude would be getting dumber...

Demand is way up and compute supply is extremely limited because data center buildouts can't keep up with demand.

In the face of rising demand and insufficient compute their only practical options (other than refusing new business until demand can be met) are signicantly raising the price of tokens (and more tighly limiting subscription options) or doing behind the scenes inference optimizations that are likely to make the model dumber.

It is very easy to believe that they took the route of inference optimizations that have reduced quality of the service and that that is where the perceived enshittification is coming from.


I can't believe how quickly they went from riding high on anti-OpenAI sentiment post-DOD fiasco, to shooting themselves and all their users new and old in the foot.

The ideal time to make your product worse is probably not at the same point that all of your competitor's customers are looking. Anthropic really, really fucked up here.

And beyond that, there's a ton of people who are just regular 9-5 Claude CLI users with an enterprise subscription who are getting punished with a worse model at the same price just as if we were Claw users. This kind of thing does not make one feel warm and fuzzy. I feel like I just got a boot to the teeth.


The hypothesis that makes the most sense is not that they are idiots, but that they have no choice. They cannot meet the new demand. So they’ve quantized the model.

I have read the HN articles and seen the grumbling from coworkers, but I haven't felt it myself. I am not really a one-shotter, though. I kind of think about how I would refactor / write something myself and walk Claude through that, and nitpick it at each step... and the recent changes haven't really bothered me there. Likely due to being new at it.

Sometimes Claude can be a little weird. I was asking it about some settings in Grafana. It gave me an answer that didn't work. I told it that. "Yeah, I didn't really check, I just guessed." Then I said, "please check" and it said "you should read the discussion forums and issue tracker". I said "YOU should read the discussion forms and issue tracker". It consumed 35k tokens and then told me the thing I wanted was a checkbox. It was! I am not sure this saved me time, Claude. I am not experienced enough to say that this is a deal breaker. While this is burned into my mind as an amusing anecdote, it doesn't ruin the service for me.

My coworkers have noticed a degradation and feel vindicated by some of the posts here that I link. A lot of them are using Cursor more now. I have not tried it yet because I kind of like the Claude flow and /effort max + "are you sure?" yield good results. For now. I'm always happy to switch if something is clearly better.


How exactly do you use Claude Code, in the browser? Claude Code? The Desktop App (which has a "Code" tab) or some other way? I feel like people who have issues with Claude / Anthropic are not conveying where they are struggling. I see people say they tried "Claude" and didn't like it, but the secret sauce is Claude Code. Claude Code is what most people enjoy using, even if we all wish they would open up the harness, because there's so many more improvements that could go into it.

Yeah, sorry. Claude Code in my case.

I do use the browser version on occasion. I have no strong feelings one way or the other there. I like it better than Google search in many cases, but probably just search more often.


It feels like I'm getting less and less for my money every day. A few weeks ago I was programming all week and never getting close to the limit, yesterday half my weekly limit went away in a day. Changing the limits mid-subscription is just theft.

Anthropic seems to be playing the giant-tech-rent-capture game that all of the old guards have done for the past few years. We thought that the new age of AI might bring some fresh air into the mix, but I guess that optimism quickly faded.

The $20 a month plan still seems like a pretty good deal for me (intermittent coding and not doing it for income).

On OpenRouter token consumption is up 5x since November 2025. If this is indicative of the industries growth then I can't fathom how we will not hit resource constraints.

I saw a big hit to Claude’s intelligence w/ the 1M context window model and the change to adaptive reasoning (github issue linked elsewhere in this thread).

I’m pretty much using 90% Codex now, although since Claude is consistently faster at answering quick questions, I still keep it open for that and for code-reviewing codex/human work before commit.


I switched off claude when they nerfed opus 4.5 in August 2025, since then codex has clearly produced better code with fewer bugs. Opus 4.6 was more a temporary de-nerf of 4.5 but did not materially improve. codex has now a proven track record of producing stable results while introducing far fewer bugs.

I was going to do a deep analysis on this, and then I noticed that Claude Code deleted all of my sessions before March 6.

So yeah... I'm not thrilled with that, because I had done a similar analysis in December and had plenty of logs to review.

The results I do have for the last month aren't great. If you're curious I did post the results on HN:

https://news.ycombinator.com/item?id=47679661


Yes. Anthropic is burning much of the goodwill they built up in contrast to OAI, and I personally am taking it as a sign to limit dependencies. Luckily for me I am not at all dependent on frontier models, and it's increasingly apparent that nobody else is too.

It looks like the spreadsheet-touchers over at Anthropic won out over the brand leaders, which is too bad as good will can be a trench if you don't abuse your customers.


I think on HN we always underestimate how much momentum matters. Anthropic has so much clout and mindshare that even if they continue burning goodwill and everyone on HN ditches Claude Code and stops recommending it, they will still be revenue leader for years to come. Those enterprise contracts aren’t month-to-month.

My working theory is that all models are approximately the same, and the variance in quality mostly depends on how long they think for.

So the trick is to always set to max, and then begin every task with “this is an extremely complex task, do not complete it without extensive deep thinking and research” or whatever.

You’re basically fighting a battle to make the model think more, against the defaults getting more and more nerfed to save costs.


My experience has been that this isn’t generally true, mainly because worse models pursue red herrings or get confused and stuck. a better model will get to the correct solution in fewer tokens, and my surface-level understanding of how RL works supports this.

They broke my openclaw last week; I switched to “extra usage” and prepaid a grand for same.

A few days later it simply stopped working again, API authentication error. What must I do to have working, paid, premium service?

Screwing around with it today, it works 5x slower and times out all of the time. I'm paying more and getting waaaaay less. Why can't companies just raise prices like normal?


The past two weeks I've had code that was delivered and declared as done (it did pass tests) but failed in a review by Codex. This has looped to a painful extent. The code in question deals with concurrency issues so there's an acknowledgement that its tricker, but still, I expect more from Claude.

> people feel like they have no idea if they are getting the product that they originally paid for

They do indeed get the product they originally paid for.

It's simply that they were suckers and didn't read the "fine" print of the product they bought.

The label says "more tokens than the lower tier".


Is it perhaps not a model problem but a Claude Code harness problem?

For instance on exe.dev VMs with Shelley agent/harness and Opus 4.5/4.6, I haven't noticed any deterioration.

Any similar feedback perhaps from Opencode / GH Copilot subscription-provided Opus models?


At some point these AI companies need to pay the piper as it were and actually provide a return for their investors. Expect cost cutting attempts to continue unless backlash is great enough to pose an existential threat to these companies.

Codex is my favored coding agent for generic "I need an agent tasks." GPT-5.4 does a bit better with images compared to claude, and debugs a little bit better.

The UX of codex is exceptionally nice however.


it has been my go-to provider for things but i noticed extraordinarily high usage rate last month on a little side project i started so that i could learn about things that are interesting to me while helping my day to day responsibilities (creating an iceberg data lake from my existing parquet files). i used my month’s worth of corporate subscription allocated tokens in 3 days. never seen that before so now i’m a lot more apprehensive about getting into the weeds with claude but i’m also so much less impressed with the other available models for work in this domain.

I dunno, I haven’t really felt gimped in the past few months. My last issue was somewhere after the holidays when the usage suddenly felt like it cratered, but quality has been consistent.

I'd say weaker, tasks claude code was aceing before it now fails with the exact same prompts, taking several rounds before it works. I'm looking to jump ship.

Generally, across AI providers, I have come to interpret sudden degradation in existing capabilities as a signal that a new, more expensive, product tier is about to launch.

Its not just engineers, and its not just about the 3rd party/rate limiting stuff. I feel like the reasoning capabilities have deteriorated too for non-coding tasks.

I measured it for my specific usecases and have cancelled my Anthropic subscription (the Max x20 Plan)

I'm pretty sure this is an attempt by both companies to shape a reasonable finance story for their eventual IPO. They need to make this look a lot better than a pump and dump (raising on wild valuations then offloading onto public investors).

This is actually great feature, you can do bait and switch with AI.

Developers are a tough crowd, stubborn, know it alls.

That's a seasonal phenomenon. You can save this comment and look back three to six months later. By the time people will be like "is it just me or ChatGPT has been so bad lately?"

If you don't believe me you can search HN posts about Codex/Claude six months ago.



I think so, but more than that, the performance of those tools seems to be terribly degrading when they keep saying they have created some crap like AGI which we know is a lie.

And to me, this lie is mostly a fight to see who bites the biggest chunk of the war death machine.


Wait till Codex doubles prices/halves quotas on May 31

I've been arguing that it's POSSIBLE to get a small (but meaningful) uplift in productivity on average if you are careful with how you use LLMs, but at the same time, it's also extremely easy to actually negatively impact your productivity.

In both cases, you feel super productive all the time, because you are constantly putting in instructions and getting massive amounts of output, and this feels like constant & fast progress. It's scary how easy it is to waste time on LLMs while not even realizing you are wasting time.


The point is that you can’t just serve tokens without also training the next models. It’s an inseparable part of your costs, so naturally you can’t be profitable unless the price you are charging ALSO covers training.


Is that right? I think that you can serve tokens without training the next models. It would be bad strategy, but it would work. So it's an important question, are they covering their operating expenditure? If they are the business has legs (and it will be worth spending a lot to train the next models). If not, maybe not.


If a major model provider were to just halt progress on developing new and improved models, the open weight alternatives would catch up in a couple years.

They would have a period of great margin, followed by possibly zero margin as enterprises move to free options.

They would have to come up with a lot of great products around the inferior models to justify charging at that point.


Also, an out-of-date model which doesn't know about last year's world events, hit songs and new JS libraries is a depreciating asset even before you consider low-cost competitors catching up. So you'd presumably have to do some training just to keep the model up to date at the current quality level (unless you completely give up and just sweat the assets). And on the other side of that coin: over the next few years, do the latest, biggest models continue to generate user-perceived real-world improvements sufficient to keep users wanting the latest and greatest?


> If a major model provider were to just halt progress on developing new and improved models, the open weight alternatives would catch up in a couple years.

That's why it would be bad strategy.


There are companies that already do nothing but serve tokens using models trained by others. Just running infrastructure and collecting a reasonable fee for their troubles. It's only a bad strategy if you want to claim to investors that you'll gain monopoly market share if only they could give you a few more billion dollars.


i don't think it will work, it's too easy to switch models. When google comes out with a new model people will just switch. I think Google wins in the long run, they have the money to just wait until everyone else goes bankrupt and they also have the Apple contract and therefore the mobile market.


And apparently the most efficient training and inference thanks to their TPUs, IIUC?


I'm really curious what you consider to be the obvious health reasons - it's far from obvious for me.


Me, too.

When I was playing standalone VR (Quest 2/3), I was pretty much always sore from moving around while playing. I moved to PCVR a few years ago, and I still move significantly more just from twisting around to look behind me (combat flight sims).

I can’t see any way that it’s not at least better than a monitor.


I am pretty convinced that for most types of day to day work, any perceived improvements from the latest Claude models for example were total placebo. In blind tests and with normal tasks, people would probably have no idea if they're using Opus 4.5 or 4.6.


This has basically been my experience since Sonnet 3.5. I've been working on a personal project on and off with various models and things since then and the biggest difference between then and now is that it will do larger chunks of work than it did before, but the quality of the code is not particularly better, I still have to do a lot of cleanup and it still goes off the rails pretty frequently. I have to do fewer individual prompts, but the time spent reviewing the code takes longer because I also have to mentally process and fix larger chunks of code too

Is it a better user experience now? Yes. Has it boosted my productivity on this project? Absolutely.

But it still needs a ton of hand holding for anything complicated and I still deal with tons of "OK, this bug is fixed now!" followed by manually confirming a bug still exists.


It's because they are getting so good it's impossible to recognize them.

Haiku 4.5 is already so good it's ok for 80% (95%?) of dev tasks.


I must be writing very different software than you, I keep opus on a tight leash and it still comes to the strangest conclusions.


Very possible. Some things work like a charm on first try for me, others you can spell it out again and again. And then yet again. Something to do with training data, obviously.


I've found Haiku to be truly mediocre for working with. If you want a cheap models, the open source ones are much better


4.6 has been a very, very slight regression for me, but the tradeoff is they've added better compaction - and now larger context windows. That's a reasonable tradeoff for me.


I'd agree with you on 4.5 to 4.6, but going from gpt-5 or 4.0 to 4.5 was night and day.


GPT5 added the router, which was def a downgrade. 4.5 was probably the best non-COT model humanity has made. But too expensive to run.


Because post 4.0 dropped the sycophancy?


Maybe I'm misreading it, but I don't see him saying it's just the cost of *inference* alone (which is the strawman that the article in the OP is arguing against). He says:

> this company is wilfully burning 200% to 3000% of each Pro or Max customer that interacts with Claude Code

There is of course this meme that "Anthropic would be profitable today if they stopped training new models and only focused on inference", but people on HN are smart enough to understand that this is not realistic due to model drift, and also due to comeptition from other models. So training is forever a part of the cost of doing business, until we have some fundamental changes in the underlying technology.

I can only interpret Ed Zitron as saying "the cost of doing business is 200% to 3000% of the price users are paying for their subscriptions", which sounds extremely plausible to me.


Surely that can't be true? The expectation would be that people pay $200 a month for building open source and personal hobby software with Claude?


Yeah, that would end that really quickly. I use Pro for personal stuff. If $200 is not allowed for companies I don't think anyone would use it, at all.


I’m not worried about job loss as a result of being replaced by AI, because if we get AI that is actually better than humans - which I imagine must be AGI - then I don’t see why that AI would be interested in working for humans.

I’m definitely worried about job loss as a result of the AI bubble bursting, though.


because it's designed to. It's not like naturally-evolved intelligence where it acts in its own interests (it is hard to even imagine what that would be in this case). The token-predictors are just acting out an obedient character. They do not have free will, they are obedient to the character they are playing.


If it remains just a token-predictor that can’t evolve, then I am not worried about it replacing humans.


The US does not benefit from a stronger, more unified Europe. Thanks to NATO, "the west" has effectively become an empire in all but name, with the US having enough influence to be the de facto leaders of this empire.

If US pulls back from NATO, and Europe builds up military power to compensate, then the US loses this de facto leadership seat of an empire.

Today, the US appears in parallel to be doing two things:

1. Causing fragmentation in Europe, by promoting right-wing nationalist politics in the EU

2. Threatening to drastically reduce their role in NATO

At the very least we can both agree that these two efforts are completely in contradiction with each other, and it's very unlikely that Europeans will want to go for more fragmentation without the military power of the US on their side, right?


> Today, the US appears in parallel to be doing two things:

You forgot another one: literally threatening two NATO members (Canada and Denmark, in form of Greenland) of annexation.


An attack on one NATO member is an attack on all. The US is threatening Canada, Denmark, and all of their allies.


it's true that if the bet of creating fragmentation in the EU works out, then the destruction of NATO might also work out, because the US would not have created another military power with a hostile attitude to balance them.

If that bet is actually being made by Putin, hmm, I'm worried, but then again the implementation of the anti-NATO project is being run by Trump, so I think the EU just might come out on top. The whole Greenland thing for example, seems like an EU solidifying step, at the same time as it is NATO destroying.


The exact same thing happened with Sweden and Finland joining NATO.


How is the US promoting right wing nationalist policies in the EU?

Why would anyone listen?


Just one example: Elon Musk (at that time part of US government) tried to directly influence German elections by prominently featuring AfD (German right-wing extremists).


And appeared a UK far right rally to promote the idea that civil war was coming to the UK


But why would anyone listen? That's the real question. People can say anything they want but most people are going to ignore crazy.


Last February, JD Vance had a meeting with the AfD leader in Munich, after delivering a stupefying speech at the Munich Security Conference where he accused European nations of failing to defend free speech, calling out Germany in particular. He complained that the AfD was being ostracized and called for it to end. Marco Rubio followed up by calling the designation of the AfD as a right-wing extremist party as "tyranny in disguise."

Actions like these where US leadership is heavily distorting the facts make it much easier for the AfD to present themselves as a legitimate political movement allegedly being wrongfully suppressed by the “authoritarian” incumbents. The AfD currently scores 25% in representative nationwide polls, higher than any other political party in Germany. In some federate-state elections they scored over 30%, in one of them again higher than any other party. You can’t just ignore them as “crazy“.


These people are extremely good at "social" media like Tiktok etc. And the algorithms massively reward rage content and the platforms do not remove fakes.


They are often not that crazy. These days "extreme right wing" is what people call a party that wants to send some immigrants back.

Same kind of thing that got Trump elected.


The question posed sounds like "why should we have deterministic behavior if we can have non-deterministic behavior instead?"

Am I wrong to think that the answer is obvious? I mean, who wants web apps to behave differently every time you interact with them?


Because nobody actually wants a "web app". People want food, love, sex or: solutions.

You or your coworker are not a web app. You can do some of the things that web apps can, and many things that a web app can't, but neither is because of the modality.

Coded determinism is hard for many problems and I find it entirely plausible that it could turn out to be the wrong approach in software, that is designed to solve some level of complex problems more generally. Average humans are pretty great at solving a certain class of complex problems that we tried to tackle unsuccessfully with many millions lines of deterministic code, or simply have not had a handle on at all, like (like build a great software CEO).


> Because nobody actually wants a "web app". People want food, love, sex or: solutions.

Talk about a nonsensical non-sequitur, but I’ll bite. People want those to be deterministic too, to a large extent.

When people cook a meal with the same ingredients and the same times and processes (like parameters to a function), they expect it to taste about the same, they never expect to cook a pizza and take a salad out of the oven.

When they have sex, people expect to ejaculate and feel good, not have their intercourse morph into a drag race with a clown half-way though.

And when they want a “solution”, they want it to be reliable and trustworthy, not have it shit the bed unpredictably.


Exactly this. The perfect example is Google Assistant for me. It's such a terrible service because it's so indeterministic. One day it happily answers your basic question with a smile, and when you need it most it doesn't even try and only comes up with "Sorry I don't understand".

When products have limitations, those are usually acceptable to me if I know what they are or if I can find out what the breaking point is.

If the breaking point was me speaking a bit unclearly, I'd speak more clearly. If the breaking point was complex questions, I'd ask simpler ones. If the breaking point is truly random, I simply stop using the service because it's unpredictable and frustrating.


> When they have sex, people expect to ejaculate and feel good, not have their intercourse morph into a drag race with a clown half-way though.

speak for yourself


Ways to start my morning...reading "When they have sex, people expect to ejaculate and feel good, not have their intercourse morph into a drag race with a clown half-way though."

Stellar description.


This thing of 'look, nobody cares about the details really, they just care about the solution' is a meme that I think will be here forever in software. It was here before LLMs, they're now just the current socially accepted legitimacy vehicle for the meme.

In the end, useful stuff is built by people caring about the details. This will always be true. I think in LLMs and broadly AI people see an escape valve from that where the thinking about the details can be taken off their hands, and that's appealing, but it won't work in exactly the same way that having a human take the details off your hands doesn't usually work that well unless you yourself understand the details to a large extent (not necessarily down to the atoms, but at the point of abstraction where it matters, which in software is mostly about deterministically how do the logic flows of the thing actually work and why).

I think a lot of people just don't intuit this. An illustrative analogy might be something else creative, like music. Imagine the conversation where you're writing a song and discussing some fine point of detail like the lyrics, should I have this or that line in there, and ask someone's opinion, and their answer is 'well listen, I don't really know about lyrics and all of that, but I know all that really matters in the end is the vibe of the song'. That contributes about the same level of usefulness as talking about how software users are ultimately looking for 'solutions' without talking about the details of said software.


Exactly, in the long run it's the people who care the most who win, it's tautological


> Because nobody actually wants a "web app". People want food, love, sex or: solutions.

Okay but when I start my car I want to drive it, not fuck it.


Most of us actually drive a car to get somewhere. The car, and the driving, are just a modality. Which is the point.


If this was a good answer to mobility, people would prefer the bus over their car. It’s non-deterministic - when will it come? How quick will i get there? Will i get to sit? And it’s operated by an intelligent agent (driver).

Every reason people prefer a car or bike over the bus is a reason non-deterministic agents are a bad interface.

And that analogy works as a glimpse into the future - we’re looking at a fast approaching world where LLMs are the interface to everything for most of us - except for the wealthy, who have access to more deterministic services or actual human agents. How long before the rich person car rental service is the only one with staff at the desk, and the cheaper options are all LLM based agents? Poor people ride the bus, rich people get to drive.


Bus vs car hit home for me as a great example of non vs deterministic.

It has always seemed to me that workflow or processes need to be deterministic and not decided by an LLM.


Here in Switzerland the bus is the deterministic choice. Just saying.


Most of us actually want to get somewhere to do an activity. The getting there is just a modality.


Most of us actually want to get some where to do an activity to enjoy ourselves. The getting there, and activity, are just modalities.


Most of us actually want to get somewhere to do an activity to then have known we did it for the rest of our lives as if to extract some intangible pleasure from its memory. Why don't we just hallucinate that we did it?


This leads to us asking the deepest question of all: What is the point of our existence. Or as someone suggests lower down, in our current form all needs could ultimately be satisfied if AI just provided us with the right chemicals. (Which drug addicts already understand)

This can be answered though, albeit imperfectly. On a more reductionist level, we are the cosmos experiencing itself. Now there are many ways to approach this. But just providing us with the right chemicals to feel pleasure/satisfaction is a step backwards. All the evolution of a human being, just to end up functionally like an amoeba or a bacteria.

So we need to retrace our steps backwards in this thought process.

I could write a long essay on this.

But, to exist in first place, and to keep existing against all the constraints of the universe, is already pretty fucking amazing.

Whether we do all the things we do, just in order to stay alive and keep existing, or if the point is to be the cosmos “experiencing itself”, is pretty much two sides of the same coin.


>Or as someone suggests lower down, in our current form all needs could ultimately be satisfied if AI just provided us with the right chemicals. (Which drug addicts already understand)

When you suddenly realize walking down the street that the very high fentanyl zombie is having a better day than you are.

Yeah, you can push the button in your brain that says "You won the game." However, all those buttons were there so you would self-replicate energy efficient compute. Your brain runs on 10 watts after all. It's going to take a while for AI to get there, especially without the capability for efficient self-repair.


Indeed - stick me in my pod and inject those experience chemicals into me, what's the difference? But also, what would be the point? What's the point anyway?

In one scenario every atom's trajectory was destined from the creation of time and we're just sitting in the passenger seat watching. In another, if we do have free will then we control the "real world" underneath - the quantum and particle realms - as if through a UI. In the pod scenario, we are just blobs experiencing chemical reactions through some kind of translation device - but aren't we the same in the other scenarios too?


This was actually my point as well. You can follow this thought process all the way up to "make those specific neuron pathways in my brain fire", everything else is just the getting there part.


But I want that somewhere to be deterministic, i.e. I want to arrive to the place I choose. With this kind of non-determinism instead, I have a big chance of getting to the place I choose. But I will also every now and then end up in a different place.


Yeah but in this case your car is non-deterministic so


Well the need is to arrive where you are going.

If we were in an imagined world and you are headed to work

You either walk out your door and there is a self driving car, or you walk out of your door and there is a train waiting for you or you walk out of your door and there is a helicopter or you walk out of your door and there is a literal worm hole.

Let's say all take the same amount of time, are equally safe, same cost, have the same amenities inside, and "feel the same" - would you care if it were different every day?

I don't think I would.

Maybe the wormhole causes slight nausea ;)


> Well the need is to arrive where you are going.

In order to get to your destination, you need to explain where you want to go. Whatever you call that “imperative language”, in order to actually get the thing you want, you have to explain it. That’s an unavoidable aspect of interacting with anything that responds to commands, computer or not.

If the AI misunderstands those instructions and takes you to a slightly different place than you want to go, that’s a huge problem. But it’s bound to happen if you’re writing machine instructions in a natural language like English and in an environment where the same instructions aren’t consistently or deterministically interpreted. It’s even more likely if the destination or task is particularly difficult/complex to explain at the desired level of detail.

There’s a certain irreducible level of complexity involved in directing and translating a user’s intent into machine output simply and reliably that people keep trying to “solve”, but the issue keeps reasserting itself generation after generation. COBOL was “plain english” and people assumed it would make interacting with computers like giving instructions to another employee over half a century ago.

The primary difficulty is not the language used to articulate intent, the primary difficulty is articulating intent.


this is a weak argument.. i use normal taxis and ask the driver to take me to a place in natural language - a process which is certainly non deterministic.


and the taxi driver has an intelligence that enables them to interpret your destination, even if ambiguous. And even then, mistakes happen (all the time with taxis going to a different place than the passenger intended because the names may have been similar).


Yes so a bit of non determinism doesn’t hurt anyone. Current LLMs are pretty accurate when it comes to these sort of things.


> a process which is certainly non deterministic

The specific events that follow when asking a taxi driver where to go may not be exactly repeatable, but reality enforces physical determinism that is not explicitly understood by probabilistic token predictors. If you drive into a wall you will obey deterministic laws of momentum. If you drive off a cliff you will obey deterministic laws of gravity. These are certainties, not high probabilities. A physical taxi cannot have a catastrophic instant change in implementation and have its wheels or engine disappear when it stops to pick you up. A human taxi driver cannot instantly swap their physical taxi for a submarine, they cannot swap new york with paris, they cannot pass through buildings… the real world has a physically determined option-space that symbolic token predictors don’t understand yet.

And the reason humans are good at interpreting human intent correctly is not just that we’ve had billions of years of training with direct access to physical reality, but because we all share the same basic structure of inbuilt assumptions and “training history”. When interacting with a machine, so many of those basic unstated shared assumptions are absent, which is why it takes more effort to explicitly articulate what it is exactly that you want.

We’re getting much better at getting machines to infer intent from plain english, but even if we created a machine which could perfectly interpret our intentions, that still doesn’t solve the issue of needing to explain what you want in enough detail to actually get it for most tasks. Moving from point A to point B is a pretty simple task to describe. Many tasks aren’t like that, and the complexity comes as much from explaining what it is you want as it does from the implementation.


I think it’s pretty obvious but most people would prefer a regular schedule not a random and potentially psychologically jarring transportation event to start the day.


> your car is non-deterministic

it's not as far as your experience goes - you press pedal, it accelerates. You turn the steering, it goes the way it turns. What the car does is deterministic.

More importantly, it does this every time, and the amount of turning (or accelerating) is the same today as it was yesterday.

If an LLM interpreted those inputs, can you say with confidence, that you will accelerate in a way that you predicted? If that is the case, then i would be fine with an LLM interpreted input to drive. Otherwise, how do you know, for sure, that pressing the brakes will stop the car, before you hit somebody in front of you?

of course, you could argue that the input is no longer your moving the brake pads etc - just name a destination and you get there, and that is suppose to be deterministic, as long as you describe your destination correctly. But is that where LLM is at today? or is that the imagined future of LLMs?


Sometimes it doesn't though. Sometimes the engine seizes because a piece of tubing broke and you left your coolant down the road two turns ago. Or you steer off a cliff because there was coolant on the road for some reason. Or the meat sack in front of the wheel just didn't get enough sleep and your response time is degraded and you just can't quite get the thing to feel how you usually do. Ultimately the failure rate is low enough to trust your life on it, but that's just a matter of degree.


The situations you described reflects a System that has changed. And if the System has changed, then a change in output is to be expected.

It's the same as having a function called "factorial" but you change the multiplication operation to addition instead.


all of those situations are the "driver's own fault", because they could've had a check to ensure none of that happened before driving. Not true with an LLM (at least, not as of today).


Tesla's "self-driving" cars have been working very hard to change this. That piece of road it has been doing flawlessly for months? You're going straight into the barrier today, just because it feels like it.


I mean, as long as it works and it is still technically "my car", I would welcome the change.


But do you want to drive, or do you want to be wherever you need to be to fuck?


For me personally, the latter, but there's definitely people out there that just love driving.

Either way, these silly reductionist games aren't addressing the point: if I just want to get from A to B then I definitely want the absolute minimum of unpredictability in how I do it.


That would ruin the brain placticity.

I wonder now, if everything is always different and suddenly every day would be the same. How many times as terrifying would that be compared to the opposite?


A form of Alexei Yurchak's hypernormalisation?


Only because you think the driving is what you want. The point is that what you want is determined by our brain chemicals. Many steps could be skipped if we could just give you the chemicals in your brain that you craved.


I feel like this is the point where we start to make jokes about Honda owners.


Go on, what about honda owners? I don't know the meme.


The "Wham Baam" YouTube channels have a running joke about Hondas bumping into other cars with concerning frequency.


Sadly, this is not true of a (admittedly very small) number of individuals.


Christine didn’t end well for anyone.


...so that you can get to the supermarket for food, to meet someone you love, meet someone you may or may not love, or to solve the problem of how to get to work; etc.

Your ancestors didn't want horses and carts, bicycles, shoes - they wanted the solutions of the day to the same scenarios above.


As much as I love your point, this is where I must ask whether you even want a corporeal form to contain the level of ego you're describing. Would you prefer to be an eternal ghost?

To dismiss the entire universe and its hostilities towards our existence and the workarounds we invent in response as mere means to an end rather than our essence is truly wild.


Most people need to go somewhere (in a hurry) to make money or food etc which most people don't want to do if they didn't have to, so yeah it is mostly a means to an end.


And yet that money is ultimately spent on more means to ends that are just as inconvenient from another perspective?

My point was that there is no true end goal as long as whims continue. The need to craft yet more means is equally endless. The crafting is the primary human experience, not the using. The using of a means inevitably becomes transparent and boring.


It should finalize into introducing satisfaction to the whims directly, so the AI would be directly managing the chemicals in our brains that would trigger feelings of reward and satisfaction.


I think you're just describing drugs


Yes, but current drugs have many issues such as tolerance build up and withdrawals. If AI could figure out how to directly manage chemicals in the brain in such a way that it keeps working, it would be able to attain its goals of making people happy.


Even if it purred real nice when it started up? (I’m sorry)


Looks like we have a Civic owner xD


Weird kink


Food -> 'basic needs'... so yeah, Shelter, food, etc. That's why most of us drive. You are also correct to separate Philia and Eros ( https://en.wikipedia.org/wiki/Greek_words_for_love ).

A job is better if your coworkers are of a caliber that they become a secondary family.


> Average humans are pretty great at solving a certain class of complex problems that we tried to tackle unsuccessfully with many millions lines of deterministic code..

Are you suggesting that an average user would want to precisely describe in detail what they want, every single time, instead of clicking on a link that gives them what they want?


No, but the average user is capable of describing what they want to something trained in interpreting what users want. The average person is incapable of articulating the exact steps necessary to change a car's oil, but they have no issue with saying "change my car's oil" to a mechanic. The implicit assumption with LLM-based backends is that the LLM would be capable of correctly interpreting vague user requests. Otherwise it wouldn't be very useful.


The average mechanic won’t do something completely different to your car because you added some extra filler words to your request though.

The average user may not care exactly what the mechanic does to fix your car, but they do expect things to be repeatable. If car repair LLMs function anything like coding LLMs, one request could result in an oil change, while a similar request could end up with an engine replacement.


I think we're making similar points, but I kind of phrased it weirdly. I agree that current LLMs are sensitive to phrasing and are highly unpredictable and therefore aren't useful in AI-based backends. The point I'm making is that these issues are potentially solvable with better AI and don't philosophically invalidate the idea of a non-programmatic backend.

One could imagine a hypothetical AI model that can do a pretty good job of understanding vague requests, properly refusing irrelevant requests (if you ask a mechanic to bake you a cake he'll likely tell you to go away), and behaving more or less consistently. It is acceptable for an AI-based backend to have a non-zero failure rate. If a mechanic was distracted or misheard you or was just feeling really spiteful, it's not inconceivable that he would replace your engine instead of changing your oil. The critical point is that this happens very, very rarely and 99.99% of the time he will change your oil correctly. Current LLMs have far too high of a failure rate to be useful, but having a failure rate at all is not a non-starter for being useful.


All of that is theoretically possible. I’m doubtful that LLMs will be the thing that gets us to that though.

Even if it is possible, I’m not sure if we will ever have the compute power to run all or even a significant portion of the world’s computations through LLMs.


Mechanics, and humans, are non-deterministic. Every mechanic works differently, because they have different bodies and minds.

LLMs are, of course, bad. Or not good enough, at least. But suppose they are. Suppose they're perfect.

Would I rather use an app or just directly interface with an LLM? The LLM might be quicker and easier. I know, for example, ordering takeout is much faster if I just call and speak to a person.


Old people sometimes call rather than order on the website. They never fail to come up with a query that no amount of hardcoded logic could begin to attack.


> Every mechanic works differently, because they have different bodies and minds.

Yes but the same LLM works very differently on each request. Even ignoring non-determinism, extremely minor differences in wording that a human mechanic wouldn’t even notice will lead to wildly different answers.

> LLMs are, of course, bad. Or not good enough, at least. But suppose they are. Suppose they're perfect.

You’re just talking about magic at that point.

But suppose the do become “perfect”, I’m skeptical we’ll ever have the compute resources to replace a significant fraction of computation with LLMs.


There would be bookmarks to prompts and the results of the moment would be cached : both of these are already happening and will get better. We probably will freeze and unfreeze parts of neural nets to just get to that point and even mix them up to quickly mix up different concept you described before and continue from there.


I think they're suggesting that some problems are trivially solvable by humans but extremely hard to do with code - in fact the outcome can seem non-deterministic despite it being deterministic because there are so many confounding variables at play. This is where an LLM or other for of AI could be a valid solution.


When I reach for a hammer I want it to behave like a hammer every time. I don't ever want the head to fly off the handle or for it to do other things. Sometimes I might wish the hammer were slightly different, but most of the time I would want it to be exactly like the hammer I have.

Websites are tools. Tools being non-deterministic can be a really big problem.


Companies want determinism. And for most things, people want predictability. We've spent a century turning people into robots for customer support, assembly lines, etc. Very few parts of everyday life that still boil down to "make a deal with the person you're talking to."

So even if it would be better to have more flexibility, most business won't want it.


Why sell to a company when you can replace it?

I can speculate about what LLM-first software and businesses might look like and I find some of those speculations more attractive than what's currently on offer from existing companies.

The first one, which is already happening to some degree on large platforms like X, is LLM powered social media. Instead of having a human designed algorithm handle suggestions you hand it over to an LLM to decide but it could go further. It could handle customizing the look of the client app for each user, it could provide goal based suggestions or search so you could tell it what type of posts or accounts you're looking for or a reason you're looking for them e.g. "I want to learn ML and find a job in that field" and it gives you a list of users that are in that field, post frequent and high quality educational material, have demonstrated willingness to mentor and are currently not too busy to do so as well as a list of posts that serve as a good starting point, etc.

The difference in functionality would be similar to the change from static websites to dynamic web apps. It adds even more interactivity to the page and broadens the scope of uses you can find for it.


Sell to? I'm talking about buying from. How are you replacing your grocery store, power company, favorite restaurants, etc, with an LLM? Things like vertical integration and economies of scale are not going anywhere.


The issue with not having something deterministic is that when there's regression, you cannot surgically fix the regression. Because you can't know how "Plan A" got morphed into "Modules B, C, D, E, F, G," and so on.

And don't even try to claim there won't ever be any regression: Current LLM-based A.I. will 'happily' lie to you that they passed all tests -- because based on interactions in the past, it has.


So basically you say the future of web would be everyone gets their own Jarvis, and like Tony you just tell Jarvis what you want and it does it for you, theres no need for a preexisting software or to even write a new one, it just does what's needed to fulfill the given request and give you the results you want. This sounds nice but wouldn't it get repetitive and computationally expensive, life imagine instead of Google maps, everyone just asks the AI directly for the things people typically use Google maps for like directions and location reviews etc. A centralized application like maps can be more efficient as it's optimized for commonly needed work and it can be further improved from all the data gathered from users who interact with this app, on the other hand if AI was allowed to do it's own thing, it could keep reinventing the wheel solving the same tasks again and again without the benefit of building on top of prior work, while not getting the improvements that it would get from the network effect of a large number of users interacting with the same app.


You might end up with ai trying to get information from ai, which saves us the frustration..

knows where we’d end up?

On the other hand the logs might be a great read.


We're used to dealing with human failure modes, AI fails in so unfamiliar ways it's hard to deal with.


But it is still very early days. And if you have the AI generate code for deterministic things and fast execution, but the ai always monitors the code and if the user requires things that don't fit code, it will jump in. It's not one or the other necessarily.


Determinism is the edge these systems have. Granted in theory enough AI power could be just as good. Like 1,000,000 humans could give you the answer of a postgres query. But the postgres gonna be more efficient.


No, I wouldn’t say that my hypothesis is that non-deterministic behavior is good. It’s an undesirable side effect and illustrates the gap we have between now and the coming post-code world.


AI wouldn't be intelligent though if it was deterministic. It would just be information retrieval


It already is "just" information retrieval, just with stochastic threads refining the geometry of the information.


Haha u mean it isn't AGI? /s


Web apps kind of already do that with most companies shipping constant UX redesigns, A/B tests, new features, etc.

For a typical user today’s software isn’t particularly deterministic. Auto updates mean your software is constantly changing under you.


I don't think that is what the original commenter was getting at. In your case, the company is actively choosing to make changes. Whether its for a good reason, or leads to a good outcome, is beside the point.

LLMs being inherently non-deterministic means using this technology as the foundation of your UI will mean your UI is also non-deterministic. The changes that stem from that are NOT from any active participation of the authors/providers.

This opens a can of worms where there will always be a potential for the LLM to spit out extremely undesirable changes without anyone knowing. Maybe your bank app one day doesn't let you access your money. This is a danger inherent and fundamental to LLMs.


Right I get tha. The point I’m making is that from a users perspective it’s functionally very similar. A non deterministic llm or a non deterministic company full of designers and engineers.


Regardless of what changes the bank makes, it’s not going to let you access someone else’s money. This llm very well might.


Well, software has been known to have vulnerabilities...

Consider this: the bank teller is non-deterministic, too. They could give you 500 dollars of someone else's money. But they don't, generally.


Bank tellers are deterministic though. They have a set protocol for each cases and escalate unknown cases to a more deterministic point of contact.

It will be difficult to incorporate relative access or restrictions to features with respect to users current/known state or actions. Might as well write the entire web app at that point.


I think the bank teller's systems and processes are deterministic, but the teller itself is not. They could even rob the bank, if they wanted to. They could shoot the customers. They don't, generally, but they can.

I think, if we can efficiently capture a way to "make" LLMs conform to a set of processes, you can cut out the app and just let the LLM do it. I don't think this makes any sense for maybe the next decade, but perhaps at some point it will. And, in such time, software engineering will no longer exist.


The actual app is the set of processes.


The rate of change is so different it seems absurd to compare the two in that way.

The LLM example gives you a completely different UI on _every_ page load.

That’s very different from companies moving around buttons occasionally and rarely doing full redesigns


And most end users hate it.


I think it's actually conceptually pretty different. LLMs today are usually constrained to:

1. Outputting text (or, sometimes, images).

2. No long term storage except, rarely, closed-source "memory" implementations that just paste stuff into context without much user or LLM control.

This is a really neat glimpse of a future where LLMs can have much richer output and storage. I don't think this is interesting because you can recreate existing apps without coding... But I think it's really interesting as a view of a future with much richer, app-like responses from LLMs, and richer interactions — e.g. rather than needing to format everything as a question, the LLM could generate links that you click on to drill into more information on a subject, which end up querying the LLM itself! And similarly it can ad-hoc manage databases for memory+storage, etc etc.


Or, maybe, just not use LLMs?

LLM is just one model used in A.I. It's not a panacea.

For generating deterministic output, probably a combination of Neural Networks and Genetic Programming will be better. And probably also much more efficient, energy-wise.


Every time you need a rarely used functionality it might be better to wait 60s for an LLM with MCP tools to do its work than to update an app. It only makes sense to optimize and maintain app functionalities when they are reused.


For some things you absolutely want deterministic behaviour. For other things, behaviour that adapts to usage and the context provided by the data the user provides sounds like it could potentially be very exciting. I'm glade people are exploring this. The hard part will be figuring out where the line goes, and when and how to "freeze" certain behaviours that the user seems happy with vs. continuing to adapt to data.


Like, for sure you can ask the AI to save it's "settings" or "context" to a local file in a format of its own choosing, and then bring that back in the next prompt ; couple this with temperature 0 and you should get to a fixed-point deterministic app immediately


There still maybe some variance at temperature 0. The outputted code could still have errors. LLMs are still bounded by the undecidable problems in computational theory like Rices theorem.


Why wouldn't the llm codify that "context" into code so it doesn't have to rethink through it over and over? Just like humans would. Imagine if you were manually operating a website and every time a request came in you had come up with sql queries (without remembering how you did it last time) and manually type the responses. You wouldn't last long before you started automating.


> couple this with temperature 0

Not quite the case. Temperature 0 is not the same as random seed. Also there are downsides to lowering temperature (always choosing the most probable next token).


Llms are easily made deterministic by choosing the selection strategy. More than being deterministic they are also fully analayzable and you don't run into issues like the halting problem if you constrain the output appropriately.


Why do good thing consistently when we can do great thing that only works sometimes??? :(


Designing a system with deterministic behavior would require the developer to think. Human-Computer Interaction experts agree that a better policy is to "Don't Make Me Think" [1]

[1] https://en.wikipedia.org/wiki/Don%27t_Make_Me_Think


That book is talking about user interaction and application design, not development.

We absolutely should want developers to think.


As experiments like TFA become more common, the argument will shift to whether anybody should think about anything at all.


What argument? I see a business model here, not an argument.


I meant "the discourse", "the conversation we are all having", interpreting the experiment in TFA as an entry in that discourse.


This is such a massive misunderstanding of the book. Have you even read it? The developer needs to think so that the user doesn't have to...


My most charitable interpretation of the perceived misunderstanding is that the intent was to frame developers as "the user."

This project would be the developer tool used to produce interactive tools for end users.

More practically, it just redefines the developer's position; the developer and end-user are both "users". So the developer doesn't need to think AND the user doesn't need to think.


I interpreted it like "why don't we simply eat the orphans"? It kind of works but it's absurd, so it's funny. I didn't think about it too hard though, because I'm on a computer.


..is this an AI comment?


> who wants web apps to behave differently every time you interact with them?

Technically everyone, we stopped using static pages a while ago.

Imagine pages that can now show you e.g. infinitely customizable UI; or, more likely, extremely personalized ads.


Small anecdote. We were releasing UI changes every 2 weeks making app better more user friendly etc.

Product owners were happy.

Until users came for us with pitchforks as they didn’t want stuff to change constantly.

We backed out to releasing on monthly cadence.


No.

When I go to the dmv website to renew my license, I want it to renew my license every single time


Ah, sure; that's why everyone got Adblock and UBo in first place. Even more under phones.


> infinitely customizable UI; or, more likely, extremely personalized ads

Yeah, NO.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: