More

comboy · 2026-04-21T22:28:12 1776810492

Code changes are cheaper to make now and kind of more expensive to verify.

So you can still contribute, you just not need to provide the code, just the issue.

Which isn't as bad as it sounds, it kind of feels bad to rewrite somebody's code right away when it is theoretically correct, but opinionated codebases seem to work very well if the maintainer opinions are sane.

hunterpayne · 2026-04-21T23:28:21 1776814101

And if the maintainer doesn't understand something about how the exploit works? Also, code changes aren't cheaper, its just that you can watch YouTube instead of putting in effort now. But time still passes and that costs the same. Reviewing the code is far more expensive now though since the LLM won't use libraries.

PS The economics of software haven't really changed, its just that people (executives) wish they have changed. They misunderstood the economics of software before LLMs and they misunderstand the economics of software now.

PPS The only people that LLMs benefit are the segment of devs who are lazy.

comboy · 2026-04-17T13:35:06 1776432906

Great minds think alike :D https://hanzirama.com/character/%E5%AD%A6

It is also allowing me to see all relevant associations easily when revealing the card in built in SRS, you add cards to SRS as you browse, so they are related to what you already know / currently exploring.

Mind you, all data visible is collected from different reputable available sources. When you click "explain" there's a clearly marked LLM explanation, but my explanation generation pipeline pushed all generated explanations through 5 different models including all top Chinese-first for verification, and on average it took a few iterations back and forth to iron out any information that could potentially mislead the learner.

You can actually see thousands of words I typed just working on that pipeline here https://hanzirama.com/making-of

rockluck · 2026-04-17T14:50:48 1776437448

this looks incredible and exactly like something i've been wanting. is there the same amount of depth for the 9k+ characters? if this is open source, id love to build on it;i was wandering if op had posted his on github.

comboy · 2026-04-17T18:04:50 1776449090

Only about 5K explanations now I'm still trying to polish the pipeline before covering more. Due to all these verification and associated regenerations the cost is quite high.

It's not opon source but it is completely free. Open sourcing is on the table but currently it would be additional work and distraction. Just licenses seem like a headache, from a quick poke even when ok for use on non-commercial data redistributing may not be ok. So not any time soon, but if you are working on something similar I can at least share detailed datasources, finding good ones was not easy and LLM integrate them fast.

fc417fc802 · 2026-04-17T20:13:16 1776456796

> even when ok for use on non-commercial data redistributing may not be ok.

What do you mean by that? Is this relating to the LLMs or the datasets used as inputs or something else?

comboy · 2026-04-17T22:15:17 1776464117

Mostly datasets but who knows if LLMs stuff also if I'm creating a datasets with them[0]. Long story short it's a potential headache, don't want that headache for now. Plus basically two options after open sourcing: people are not using it (time and effort wasted) or people are using it and then it's a chore. But still on the table. But I'm currently not close to the table.

0. https://claude.ai/share/23d27617-7608-4c58-93b9-b9613bc29318

comboy · 2026-04-17T07:28:11 1776410891

From my experience, saying "this is not X, it will be not used for Y" is vastly increasing chances of this being classified as being X. Anybody can write "this is authorized research". Instead use something like evaluate security / verify security, make sure this cannot be (...), etc.

Of course these models are pretty smart so even Anthropic's simple instructions not to provide any exploits stick better and better.

comboy · 2026-04-15T22:25:31 1776291931

Tried antigravity and cli a few times. I'm unable to handle that prodigious toddler. Are you guys able to make use of it?

NeutralWanted · 2026-04-16T00:16:31 1776298591

Been using it for work and personal daily since release. First few weeks were rough but it's probably the best AI code editor out right now. But that's largely due to the models just be superior

operatingthetan · 2026-04-15T22:30:01 1776292201

CLI is great, probably 90% as good as CC.

girvo · 2026-04-15T22:35:33 1776292533

Antigravity CLI or the Gemini one? When I tried the latter about 2 months ago it was shockingly bad, though I was a free user. I assume its better if you're a paying customer?

operatingthetan · 2026-04-15T22:38:07 1776292687

It doesn't appear that there is an Antigravity CLI, so the ladder. I'm using a paid account though.

For the last few months I was using paid versions of CC, Codex and Gemini CLI, and found them more or less equivalent for my uses. I'm just building web apps though.

alxhslm · 2026-04-16T05:38:58 1776317938

In my experience, it was fine for simple tasks - but CC was well ahead for anything which required substantial “thinking”.

comboy · 2026-04-15T19:21:55 1776280915

I mean, yes but LLMs have been making me more cognitively active. I've learned how to do more stuff that I would have without them and it's a decent multiplier not some rounding error.

Obviously you can have a plumber that knows his stuff and the one that doesn't. The good one can check some details and will recognize bs. If you already have the bad one it's probably if better if he uses LLM rather than when doesnt.

comboy · 2026-04-14T20:18:40 1776197920

We are sorry, but your print resembles random princess from Disney too much (actually, we won't tell you which). Just following the law you know..

comboy · 2026-04-14T19:33:47 1776195227

Unrelated, but Claude was performing so tragically last few days, maybe week(s), but days mostly, that I had to reluctantly switch. Reluctantly because I enjoy it. Even the most basic stuff, like most python scripts it has to rerun because of some syntax error.

The new reality of coding took away one of the best things for me - that the computer always just does what it is told to do. If the results are wrong it means I'm wrong, I made a bug and I can debug it. Here.. I'm not a hater, it's a powerful tool, but.. it's different.

scandinavian · 2026-04-15T06:19:27 1776233967

I'm not a big user, but I have been doing some vibe-ish coding for a PoC the past few days, and I'm astonished at how bad it is at python in particular (Opus 4.6 High).

* It likes to put inline imports everywhere, even though I specify in my CLAUDE.md that it should not.

* We use ruff and pyright and require that all problems are addressed or at least ignored for a good reason, but it straight up #noqa ignores all issues instead.

* For typing it used the builtin 'any' instead of typing.Any which is nonsense.

* I asked it to add a simple sum of a column from a related database table, but instead of using a calculated sum in SQL it did a classic n+1 where it gets every single row from the related table and calculates the sum in python.

Just absolute beginner errors.

codethief · 2026-04-15T23:21:00 1776295260

Weird, I've been using Claude Code w/ Opus 4.5/4.6 for Python programming the last couple months and I rarely see the issues you mentioned.

Lord-Jobo · 2026-04-15T11:49:24 1776253764

It really does have some disgusting inline behaviors. I’ve also seen it do some really bizarre stuff with SQL

lovlar · 2026-04-15T08:01:19 1776240079

Clajjan is this you?

taspeotis · 2026-04-14T22:57:21 1776207441

https://marginlab.ai/trackers/claude-code/

comboy · 2026-04-15T11:14:17 1776251657

I think API is fine, likely only subscription is affected. Not to mention trivial heuristics to differentiate repeated API calls / same data and potential CLI usage although that would be true malice.

It seemed to me that it was performing better through opencode using API but did not test extensively.

chillacy · 2026-04-15T02:08:09 1776218889

If SWE Bench is public then Anthropic is at a minimum probably also looking at their SWE bench scores when making changes, I'd trust more a tracker which runs a private benchmark not known to Anthropic.

bluegatty · 2026-04-14T21:12:12 1776201132

Codex with 5.4 xhigh. It's a bad communicator but does the job.

elAhmo · 2026-04-14T23:27:31 1776209251

You mean codex (client) with GPT 5.4 xhigh? I am using Codex 5.3 (model) through Cursor, waiting for Codex 5.4 model as I had great experience so far with 5.3.

bluegatty · 2026-04-14T23:52:39 1776210759

yes codex. it has 5.4.

winrid · 2026-04-15T05:10:25 1776229825

It's bad at long running tasks.

bluegatty · 2026-04-15T05:19:37 1776230377

Yes and no. It's bad because of shorter context but it does have auto-compaction which was much better than Claude. If you provide it documentation to work from and re-reference, it works long-running.

Honestly - 'every inch of IQ delta' seems to be worth it over anything else.

I'm a long time Claude Code supporter - and I'm ashamed to admit how instantly I dropped it when discovering how much better 5.4 is.

I don't trust Claude anymore for anything that requires heavy thinking - Codex always finds flaws in the logic.

But this happens every few months.

winrid · 2026-04-15T18:00:33 1776276033

I tried to use 5.4 for something pretty straightforward - create scripts to automate navigating a game UI and capturing the network traffic. 5.4 was super frustrating, constantly stopping and waiting for feedback etc, even after telling it to never wait and just iterate/debug. I quit and switched to Opus 4.6 and it did much more of the work by itself.

bluegatty · 2026-04-16T04:22:41 1776313361

I've never run into that problem, but these were coding solutions in codex with a strong plan, steps to work towards.

It could be that if you're using massive tokens on a 'plan' then then want to limit u in a way, or even if the objective is not perfectly clear they don't want semi-random token use.

See if the token/sub solution behaves differently. Make sure that when it 'compacts' that it re-reads your instructions clearly.

winrid · 2026-04-17T03:10:31 1776395431

bluegatty · 2026-04-18T08:40:45 1776501645

Well I wish I could help - but things changes so fast - codex with opus 4.7 is not very strong. you have to set the effort level relatively high though.

pacha3000 · 2026-04-14T19:52:48 1776196368

I'm the first to be tired of everyone, for every model, that says "uuuh became dumber" because I didn't believe them

... until this week! Opus is struggling worse than Sonnet those last two weeks.

saghm · 2026-04-14T21:57:08 1776203828

Forget the agent itself being dumber: right now I'm getting an "API error: usage limit exceeded" message whenever I try anything despite my usage showing as 26% for the session limit and 8% for the week (with 0/5 routines, which I guess is what this thread is about). This is with the default model and effort, and Claude Code is saying I need to turn on extra usage for it to work. Forget that, I just canceled my subscription instead.

There's utility in LLMs for coding, but having literally the entire platform vibe-coded is too much for me. At this point, I might genuinely believe they're not intentionally watering anything down, because it's incredibly believable that they just have no clue how any of it works anymore.

jpcompartir · 2026-04-14T21:10:59 1776201059

Likewise, I foolishly assumed everybody else was just doing it wrong.

But this week I've lost count of the times I've had to say something along the lines of: "Can you check our plan/instructions, I'm pretty sure I said we need to do [this thing] but you've done [that thing]..."

And get hit with a "You're absolutely right...", which virtually never happened for me. I think maybe once since Opus 4-6.

spoiler · 2026-04-15T06:23:51 1776234231

Honestly, I thought it was a skill issue too, but it just turns out I wasn't using it enough.

I started a new job recently, so I'm asking it a lot of questions about the codebase, sometimes just to confirm my understanding and often it came up with wrong conclusions that would send me down rabbit holes only to find out it was wrong.

On a side project I gave it literally a formula and told it to run it with some other parameters. It was doing its usual "let me get to know the codebase" then a "I have a good understanding of the codebase" speech, only to follow it up with "what you're asking is not possible" I'm like... No, I know it's possible I implemented it already, just use it in more places only to get the same "o ye ur right, I missed that... Blabla"

Yeah, it's gotten pretty bad...

david_d8912 · 2026-04-15T17:25:16 1776273916

maybe a consequence of saving GPU for newer models? Also tuning effort level suppose to help, haven't get enough dp on this though

redanddead · 2026-04-15T06:51:18 1776235878

They track our frustration, which is probably really good coding data. The reason why it's painful is because that's data annotation, it's literally a job people get paid to do, yet we're paying to do it. If they need good data, they just turn the models to shit and gaslight everyone

girvo · 2026-04-14T21:41:05 1776202865

My favourite was, Opus 4.6 last night (to be fair peak IST time, late afternoon my time), the first prompt with a small context: jams a copy-pasted function in between a bunch of import statements, doesn't even wire up it's own function and calls it done. Wild, I've not seen failure states like that since old Sonnet 4

data-ottawa · 2026-04-15T01:18:44 1776215924

Yesterday I had my biggest Opus WTF.

I asked Opus 4.6 to help me get GPU stats in btop on nixos. Opus's first approach was to use patchelf to monkey patch the btop binary. I had to redirect it to just look the nix wiki and add `nixpkgs.config.rocmSupport = true;`.

But the approach of modifying a compiled binary for a configuration issue is bizarre.

pxc · 2026-04-15T03:06:09 1776222369

It does stuff like this all the time. It loves doing this with scripts with sed, so I'm not surprised to hear about it trying to do this with binaries. It's definitely wilder, though

spoiler · 2026-04-15T06:32:22 1776234742

It frequently gets indentation wrong on projects, then tries to write sed/awk scripts. Can't get it right, then write a python script that reformats the whole file on stdout, makes sure the indentation is correct, then writes requests an edit snippet.

And you might be thinking. Well, you should use a code formatter! But I do!

And then you might say, well surely you forgot to mention it in you AGENTS/CLAUDE file. Nope, it's there, multiple times even in different sections because once was apparently not enough.

And lastly, surely if I'm watching this cursed loop unfold and am approving edits manually, like some bogan pleb, I can steer it easily... Well, let me tell ya... I tried stopping it and injecting hints about the formatter, and it stick for a minute before it goes crazy again. Or sometimes it rereads the file and just immediately fucks up the formatting.

I think when this shit happens, it probably uses like 3x more tokens.

For a Rust project, it recently stated analysing binaries in the target as directory a first instinct, instead of looking at the code...

Good grief.

combyn8tor · 2026-04-14T21:54:06 1776203646

In my experience Opus and Claude have declined significantly over the past few weeks. It actually feels like dealing with an employee that has become bored and intentionally cuts corners.

rishabhaiover · 2026-04-15T00:18:36 1776212316

And the worse part is the company is gaslighting people when they report it

comboy · 2026-04-14T20:21:09 1776198069

Pretty reassuring to hear that. I was skeptical too, there's a lot of variables like some crap added to memory specific skill or custom instructions interfering with the workflow and what not. But now it was like a toddler that consumes money when talking.

timacles · 2026-04-14T21:45:03 1776203103

It’s quite an interesting business model actually that the worse it performs to a degree the more money it makes you because of the token churn

qingcharles · 2026-04-14T21:56:26 1776203786

Is it? Or is it the task you're trying to do? Opus 4.6 has been staggeringly good for me this last week, both inside Claude Code and through Antigravity until I used up my quota.

SoMomentary · 2026-04-15T01:43:58 1776217438

I think some of this comes down to undeclared A/B testing. I've had the worst week of interactions I have ever had using Claude Code. The whole week whenever I have a session that isn't failing miserably I seem to get tapped for a session survey but on any that are out and out shitting the bed it never asks. It has felt a little surreal. I'd love to see a product wide stats graph for swearing, I would 100% believe that it is hitting an all time high but maybe I'm just a victim of a bad A/B round.

oefrha · 2026-04-15T03:08:50 1776222530

Oh I’ve been getting a lot more of those too lately even though I dismiss it every time. Wonder if I should report not satisfied every time so that I get routed to something better…

qingcharles · 2026-04-15T21:10:41 1776287441

Here's some good looking anecdata:

https://x.com/fkysly/status/2044283560170004777

pacha3000 · 2026-04-15T11:47:00 1776253620

Usually, Claude code with Opus checks by itself the right tools to check the docs, for Svelte for example. So what it gives me is usually flawless.

And right now, I have to remind it every time that the MCP exists, and even then it cannot manage to find a routing bug I have with Sveltekit.

Did a lot of Sveltekit with Opus in the past, and I didn't have to think about it, Opus always got it right easily. Until now

bicepjai · 2026-04-15T03:07:16 1776222436

Yes totally agree it’s regurgitating crazy expansive text like book author who needs to publish 10 books a day

comboy · 2026-04-12T18:47:39 1776019659

I have the same bias as the parent. I'd rather pay $50 one time than $9 a year even if I throw it away after 4 years.

But the main reason I wouldn't install it despite being happy customizing linux is that it's yet another black box I need to trust and that knows way too much. It's really insane how much you need to compromise your security on macos to have a decent developer experience.

cactusplant7374 · 2026-04-12T20:38:27 1776026307

It's not economical. Lifetime sales for a lifetime unlock would probably be under $100. So not worth it for the developer.

comboy · 2026-04-12T18:02:46 1776016966

To be fair, I can do 3 balls effortlessly, but I can't do 1 ball like it is in this description, I just have a lot of error correction, enough to do it pretty much indefinitely. But I cannot reliably throw it accurately to the other hand.

Our software stack is the opposite of that.

comboy · 2026-04-12T14:59:24 1776005964

You just convinced me to try it. Claude just copy pastes, does search and replace, zero abstractions and I'm the one that needs to think about the edge cases.

dns_snek · 2026-04-13T15:19:02 1776093542

You may think that's a good thing but it's not. Codex is great at coming up with solutions to problems that don't exist and failing to find solution to problems that do. In the end you have 300 new lines of code and nothing to show for it.

stavros · 2026-04-12T16:01:43 1776009703

That's why I have Claude write the code and Codex review.

bdangubic · 2026-04-12T16:05:43 1776009943

that’s like having oleg kiselyov’s code reviewed by my middle school daughter :)

stavros · 2026-04-12T16:08:21 1776010101

I didn't know your middle school daughter is a genius coder, congratulations!