Code changes are cheaper to make now and kind of more expensive to verify.
So you can still contribute, you just not need to provide the code, just the issue.
Which isn't as bad as it sounds, it kind of feels bad to rewrite somebody's code right away when it is theoretically correct, but opinionated codebases seem to work very well if the maintainer opinions are sane.
And if the maintainer doesn't understand something about how the exploit works? Also, code changes aren't cheaper, its just that you can watch YouTube instead of putting in effort now. But time still passes and that costs the same. Reviewing the code is far more expensive now though since the LLM won't use libraries.
PS The economics of software haven't really changed, its just that people (executives) wish they have changed. They misunderstood the economics of software before LLMs and they misunderstand the economics of software now.
PPS The only people that LLMs benefit are the segment of devs who are lazy.
It is also allowing me to see all relevant associations easily when revealing the card in built in SRS, you add cards to SRS as you browse, so they are related to what you already know / currently exploring.
Mind you, all data visible is collected from different reputable available sources. When you click "explain" there's a clearly marked LLM explanation, but my explanation generation pipeline pushed all generated explanations through 5 different models including all top Chinese-first for verification, and on average it took a few iterations back and forth to iron out any information that could potentially mislead the learner.
this looks incredible and exactly like something i've been wanting. is there the same amount of depth for the 9k+ characters? if this is open source, id love to build on it;i was wandering if op had posted his on github.
Only about 5K explanations now I'm still trying to polish the pipeline before covering more. Due to all these verification and associated regenerations the cost is quite high.
It's not opon source but it is completely free. Open sourcing is on the table but currently it would be additional work and distraction. Just licenses seem like a headache, from a quick poke even when ok for use on non-commercial data redistributing may not be ok. So not any time soon, but if you are working on something similar I can at least share detailed datasources, finding good ones was not easy and LLM integrate them fast.
Mostly datasets but who knows if LLMs stuff also if I'm creating a datasets with them[0]. Long story short it's a potential headache, don't want that headache for now. Plus basically two options after open sourcing: people are not using it (time and effort wasted) or people are using it and then it's a chore. But still on the table. But I'm currently not close to the table.
From my experience, saying "this is not X, it will be not used for Y" is vastly increasing chances of this being classified as being X. Anybody can write "this is authorized research". Instead use something like evaluate security / verify security, make sure this cannot be (...), etc.
Of course these models are pretty smart so even Anthropic's simple instructions not to provide any exploits stick better and better.
Been using it for work and personal daily since release. First few weeks were rough but it's probably the best AI code editor out right now. But that's largely due to the models just be superior
Antigravity CLI or the Gemini one? When I tried the latter about 2 months ago it was shockingly bad, though I was a free user. I assume its better if you're a paying customer?
It doesn't appear that there is an Antigravity CLI, so the ladder. I'm using a paid account though.
For the last few months I was using paid versions of CC, Codex and Gemini CLI, and found them more or less equivalent for my uses. I'm just building web apps though.
I mean, yes but LLMs have been making me more cognitively active. I've learned how to do more stuff that I would have without them and it's a decent multiplier not some rounding error.
Obviously you can have a plumber that knows his stuff and the one that doesn't. The good one can check some details and will recognize bs. If you already have the bad one it's probably if better if he uses LLM rather than when doesnt.
Unrelated, but Claude was performing so tragically last few days, maybe week(s), but days mostly, that I had to reluctantly switch. Reluctantly because I enjoy it. Even the most basic stuff, like most python scripts it has to rerun because of some syntax error.
The new reality of coding took away one of the best things for me - that the computer always just does what it is told to do. If the results are wrong it means I'm wrong, I made a bug and I can debug it. Here.. I'm not a hater, it's a powerful tool, but.. it's different.
I'm not a big user, but I have been doing some vibe-ish coding for a PoC the past few days, and I'm astonished at how bad it is at python in particular (Opus 4.6 High).
* It likes to put inline imports everywhere, even though I specify in my CLAUDE.md that it should not.
* We use ruff and pyright and require that all problems are addressed or at least ignored for a good reason, but it straight up #noqa ignores all issues instead.
* For typing it used the builtin 'any' instead of typing.Any which is nonsense.
* I asked it to add a simple sum of a column from a related database table, but instead of using a calculated sum in SQL it did a classic n+1 where it gets every single row from the related table and calculates the sum in python.
I think API is fine, likely only subscription is affected. Not to mention trivial heuristics to differentiate repeated API calls / same data and potential CLI usage although that would be true malice.
It seemed to me that it was performing better through opencode using API but did not test extensively.
If SWE Bench is public then Anthropic is at a minimum probably also looking at their SWE bench scores when making changes, I'd trust more a tracker which runs a private benchmark not known to Anthropic.
You mean codex (client) with GPT 5.4 xhigh? I am using Codex 5.3 (model) through Cursor, waiting for Codex 5.4 model as I had great experience so far with 5.3.
Yes and no. It's bad because of shorter context but it does have auto-compaction which was much better than Claude. If you provide it documentation to work from and re-reference, it works long-running.
Honestly - 'every inch of IQ delta' seems to be worth it over anything else.
I'm a long time Claude Code supporter - and I'm ashamed to admit how instantly I dropped it when discovering how much better 5.4 is.
I don't trust Claude anymore for anything that requires heavy thinking - Codex always finds flaws in the logic.
I tried to use 5.4 for something pretty straightforward - create scripts to automate navigating a game UI and capturing the network traffic. 5.4 was super frustrating, constantly stopping and waiting for feedback etc, even after telling it to never wait and just iterate/debug. I quit and switched to Opus 4.6 and it did much more of the work by itself.
I've never run into that problem, but these were coding solutions in codex with a strong plan, steps to work towards.
It could be that if you're using massive tokens on a 'plan' then then want to limit u in a way, or even if the objective is not perfectly clear they don't want semi-random token use.
See if the token/sub solution behaves differently. Make sure that when it 'compacts' that it re-reads your instructions clearly.
Well I wish I could help - but things changes so fast - codex with opus 4.7 is not very strong. you have to set the effort level relatively high though.
Forget the agent itself being dumber: right now I'm getting an "API error: usage limit exceeded" message whenever I try anything despite my usage showing as 26% for the session limit and 8% for the week (with 0/5 routines, which I guess is what this thread is about). This is with the default model and effort, and Claude Code is saying I need to turn on extra usage for it to work. Forget that, I just canceled my subscription instead.
There's utility in LLMs for coding, but having literally the entire platform vibe-coded is too much for me. At this point, I might genuinely believe they're not intentionally watering anything down, because it's incredibly believable that they just have no clue how any of it works anymore.
Likewise, I foolishly assumed everybody else was just doing it wrong.
But this week I've lost count of the times I've had to say something along the lines of:
"Can you check our plan/instructions, I'm pretty sure I said we need to do [this thing] but you've done [that thing]..."
And get hit with a "You're absolutely right...", which virtually never happened for me. I think maybe once since Opus 4-6.
Honestly, I thought it was a skill issue too, but it just turns out I wasn't using it enough.
I started a new job recently, so I'm asking it a lot of questions about the codebase, sometimes just to confirm my understanding and often it came up with wrong conclusions that would send me down rabbit holes only to find out it was wrong.
On a side project I gave it literally a formula and told it to run it with some other parameters. It was doing its usual "let me get to know the codebase" then a "I have a good understanding of the codebase" speech, only to follow it up with "what you're asking is not possible" I'm like... No, I know it's possible I implemented it already, just use it in more places only to get the same "o ye ur right, I missed that... Blabla"
They track our frustration, which is probably really good coding data. The reason why it's painful is because that's data annotation, it's literally a job people get paid to do, yet we're paying to do it. If they need good data, they just turn the models to shit and gaslight everyone
My favourite was, Opus 4.6 last night (to be fair peak IST time, late afternoon my time), the first prompt with a small context: jams a copy-pasted function in between a bunch of import statements, doesn't even wire up it's own function and calls it done. Wild, I've not seen failure states like that since old Sonnet 4
I asked Opus 4.6 to help me get GPU stats in btop on nixos. Opus's first approach was to use patchelf to monkey patch the btop binary. I had to redirect it to just look the nix wiki and add `nixpkgs.config.rocmSupport = true;`.
But the approach of modifying a compiled binary for a configuration issue is bizarre.
It does stuff like this all the time. It loves doing this with scripts with sed, so I'm not surprised to hear about it trying to do this with binaries. It's definitely wilder, though
It frequently gets indentation wrong on projects, then tries to write sed/awk scripts. Can't get it right, then write a python script that reformats the whole file on stdout, makes sure the indentation is correct, then writes requests an edit snippet.
And you might be thinking. Well, you should use a code formatter! But I do!
And then you might say, well surely you forgot to mention it in you AGENTS/CLAUDE file. Nope, it's there, multiple times even in different sections because once was apparently not enough.
And lastly, surely if I'm watching this cursed loop unfold and am approving edits manually, like some bogan pleb, I can steer it easily... Well, let me tell ya... I tried stopping it and injecting hints about the formatter, and it stick for a minute before it goes crazy again. Or sometimes it rereads the file and just immediately fucks up the formatting.
I think when this shit happens, it probably uses like 3x more tokens.
For a Rust project, it recently stated analysing binaries in the target as directory a first instinct, instead of looking at the code...
In my experience Opus and Claude have declined significantly over the past few weeks. It actually feels like dealing with an employee that has become bored and intentionally cuts corners.
Pretty reassuring to hear that. I was skeptical too, there's a lot of variables like some crap added to memory specific skill or custom instructions interfering with the workflow and what not. But now it was like a toddler that consumes money when talking.
Is it? Or is it the task you're trying to do? Opus 4.6 has been staggeringly good for me this last week, both inside Claude Code and through Antigravity until I used up my quota.
I think some of this comes down to undeclared A/B testing. I've had the worst week of interactions I have ever had using Claude Code. The whole week whenever I have a session that isn't failing miserably I seem to get tapped for a session survey but on any that are out and out shitting the bed it never asks. It has felt a little surreal. I'd love to see a product wide stats graph for swearing, I would 100% believe that it is hitting an all time high but maybe I'm just a victim of a bad A/B round.
Oh I’ve been getting a lot more of those too lately even though I dismiss it every time. Wonder if I should report not satisfied every time so that I get routed to something better…
I have the same bias as the parent. I'd rather pay $50 one time than $9 a year even if I throw it away after 4 years.
But the main reason I wouldn't install it despite being happy customizing linux is that it's yet another black box I need to trust and that knows way too much. It's really insane how much you need to compromise your security on macos to have a decent developer experience.
To be fair, I can do 3 balls effortlessly, but I can't do 1 ball like it is in this description, I just have a lot of error correction, enough to do it pretty much indefinitely. But I cannot reliably throw it accurately to the other hand.
You just convinced me to try it. Claude just copy pastes, does search and replace, zero abstractions and I'm the one that needs to think about the edge cases.
You may think that's a good thing but it's not. Codex is great at coming up with solutions to problems that don't exist and failing to find solution to problems that do. In the end you have 300 new lines of code and nothing to show for it.
So you can still contribute, you just not need to provide the code, just the issue.
Which isn't as bad as it sounds, it kind of feels bad to rewrite somebody's code right away when it is theoretically correct, but opinionated codebases seem to work very well if the maintainer opinions are sane.
reply