- User is a bit of a dick (bad)
- Engineer attempts to defuse situation (okay)
- user expands (good)
- CEO escalates situation (terrible)
Aiden definitely didn't begin the interaction the right way but it's also taking place over Twitter and the platform encourages refined points (would anyone have responded if his second response was his first?). The engineer got things going in the right direction but then the CEO turned it all around and made it far worse than had they just let Aiden yell into the void. It screamed arrogance and a disconnect from the users. Sorry, but the number of users a product has often doesn't correlate with its quality.
You need to also consider expectation and responsibility. Unfortunately there's no expectation or responsibility for a user to be well behaved. But that's not true for a business and especially a CEO. Yes, you can say it's unfair that responsibility doesn't go both ways but also recognize that there's a vastly different power dynamic.
I think an important thing to add is that users don't always know how to properly complain. So a difficulty is figuring out what they actually want. They're on the outside looking in, so don't know all the details but they can express that they have a problem. It can often be hard, and frustrating, to figure out what that problem actually is but if they're communicating then it is usually not too difficult to diffuse the situation. As long as they feel you are trying to understand.
Another part is that we're breeding a society of Karens. "The squeaky wheel gets the grease". The wheels not squeaking aren't getting regular maintenance or care. No one is incentivized to ask nicely but people are strongly being incentivized to scream. To generalize outside software: a loyal customer gets standard service but Karen gets a discount or something free just to make her go away. It's natural that we do that but it's the wrong reward system. When you reward a dog when they stop barking they only learn to bark.
Agreed, I'm always trying to improve my communication skills and I think it's actually the core difficulty of modern society - as, honestly, it has been since Socrates talked about what we would now call existential loneliness.
I feel like "abstraction" is overloaded in many conversations.
Personally I love abstraction when it means "generalize these routines to a simple and elegant version". Even if it's harder to understand than a single instance it is worth the investment and gives far better understanding of the code and what it's doing.
But there's also abstraction meaning to make less understandable or more complex and I think LLMs operate this way. It takes a long time to understand code. Not because any single line of code is harder to understand but because they need to be understood in context.
I think part of this is in people misunderstanding elegance. It doesn't mean aesthetically pleasing, but to do something in a simple and efficient way. Yes, write it rough the first round but we should also strive for elegance. It more seems like we are just trying to get the first rough draft and move onto the next thing.
What's concerning to many of us is that you've (and others) have said this same thing s/Opus 4.5/some other model/
That feels more like chasing than a clear line of improvement. It's interrupted very different from something like "my habits have changed quite a bit since reading The Art of Computer Programming". They're categorically different.
It's because the models keep getting better! What you could do with GPT-4 was more impressive than what you could do with GPT 3.5. What you could do with Sonnet 3.5 was more impressive yet, and Sonnet 4, and Sonnet 4.5.
Some of these improvements have been minor, some of them have been big enough to feel like step changes. Sonnet 3.7 + Claude Code (they came out at the same time) was a big step change; Opus 4.5 similarly feels like a big step change.
If you're sincerely trying these models out with the intention of seeing if you can make them work for you, and doing all the things you should do in those cases, then even if you're getting negative results somehow, you need to keep trying, because there will come a point where the negative turns positive for you.
If you're someone who's been using them productively for a while now, you need to keep changing how you use them, because what used to work is no longer optimal.
Models keep getting better but the argument I'm critiquing stays the same.
So does the comment I critiqued in the sibling comment to yours. I don't know why it's so hard to believe we just haven't tried. I have a Claude subscription. I'm an ML researcher myself. Trust me, I do try.
But that last part also makes me keenly aware of their limitations and failures. Frankly I don't trust experts who aren't critiquing their field. Leave the selling points to the marketing team. The engineer and researcher's job is to be critical. To find problems. I mean how the hell do you solve problems if you're unable to identify them lol. Let the marketing team lead development direction instead? Sounds like a bad way to solve problems
> benchmark shows huge improvements
Benchmarks are often difficult to interpret. It is really problematic that they got incorporated into marketing. If you don't understand what a benchmark measures, and more importantly, what it doesn't measure, then I promise you that you're misunderstanding what those numbers mean.
For METR I think they say a lot right here (emphasis my own) that reinforces my point
> Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most *exam-style problems* for a fraction of the cost. ... And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. *They are unable to reliably handle even relatively low-skill*, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in some sense, but it is unclear how this corresponds to real-world impact.
So make sure you're really careful to understand what is being measured. What improvement actually means. To understand the bounds.
It's great that they include longer tasks but also notice the biases and distribution in the human workers. This is important in properly evaluating.
Also remember what exactly I quoted. For a long time we've all known that being good at leetcode doesn't make one a good engineer. But it's an easy thing to test and the test correlates with other skills that are likely to be learned to be good at those tests (despite being able to metric hack). We're talking about massive compression machines. That pattern match. Pattern matching tends to get much more difficult as task time increases but this is not a necessary condition.
Treat every benchmark adversarialy. If you can't figure out how to metric hack it then you don't know what a benchmark is measuring (and just because you know what can hack it doesn't mean you understand it nor that that's what is being measured)
I think you should ask yourself: If it were true that 1) these things do in fact work, 2) these things are in fact getting better... what would people be saying?
The answer is: Exactly what we are saying. This is also why people keep suggesting that you need to try them out with a more open mind, or with different techniques: Because we know with absolute first-person iron-clad certainty what is possible, and if you don't think it's possible, you're missing something.
It seems to be "people keep saying the models are good"?
That's true. They are.
And the reason people keep saying it is because the frontier of what they do keeps getting pushed back.
Actual, working, useful code completion in the GPT 4 days? Amazing! It could automatically write entire functions for me!
The ability to write whole classes and utility programs in the Claude 3.5 days? Amazing! This is like having a junior programmer!
And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!
But now we are beginning to see that programming in 6 months time might look very different to now because these AI system code very differently to us. That's exactly the point.
So what is it you are arguing against?
I think you said you didn't like that people are saying the same thing, but in this post it seems more complicated?
Is there an endpoint for AI improvement? If we can go from functions to classes to substantial programs then it seems like just a few more steps to rewriting whole software products and putting a lot of existing companies out of business.
"AI, I don't like paying for my SAP license, make me a clone with just the features I need".
> And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!
People have been doing this parlor trick with various "substantial" programs [1] since GPT 3. And no, the models aren't better today, unless you're talking about being better at the same kinds of programs.
[1] If I have to see one more half-baked demo of a running game or a flight sim...
It’s a vague statement that I obviously cannot defend in all interpretations, but what I mean is: the performance of models at making non-trivial applications end-to-end, today, is not practically better than it was a few years ago. They’re (probably) better at making toys or one-shotting simple stuff, and they can definitely (sometimes) crank out shitty code for bigger apps that “works”, but they’re just as terrible as ever if you actually understand what quality looks like and care to keep your code from descending into entropy.
I think "substantial" is doing a lot of heavy lifting in the sentence I quoted. For example, I’m not going to argue that aspects of the process haven’t improved, or that Claude 4.5 isn't better than GPT 4 at coding, but I still can’t trust any of the things to work on any modestly complex codebase without close supervision, and that is what I understood the broad argument to be about. It's completely irrelevant to me if they slay the benchmarks or make killer one-shot N-body demos, and it's marginally relevant that they have better context windows or now hallucinate 10% less often (in that they're more useful as tools, which I don't dispute at all), but if you want to claim that they're suddenly super-capable robot engineers that I can throw at any "substantial" problem, you have to bring evidence, because that's a claim that defies my day-to-day experience. They're just constantly so full of shit, and that hasn't changed, at all.
FWIW, this line of argument usually turns into a mott and bailey fallacy, where someone makes an outrageous claim (e.g. "models have recently gained the ability to operate independently as a senior engineer!"), and when challenged on the hyperbole, retreats to a more reasonable position ("Claude 4.5 is clearly better than GPT 3!"), but with the speculative caveat that "we don't know where things will be in N years". I'm not interested in that kind of speculation.
Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.
Why do you use the word "chasing" to describe this? I don't understand. Maybe you should try it and compare it to earlier models to see what people mean.
> Why do you use the word "chasing" to describe this?
I think you'll get the answer to this if you read my comment and your response to understand why you didn't address mine.
Btw, I have tried it. It's annoying that people think the problem is not trying. It was getting old when GPT 3.5 came out. Let's update the argument...
> the solution to uncertainty is to pile abstraction on top of abstraction until no one can explain what’s actually happening anymore.
I've usually found complaints about abstraction in programming odd because frankly, all we do is abstraction. It often seems to be used to mean /I/ don't understand, therefore we should do something more complicated and with many more lines of code that's less flexible.
But this usage? I'm fully on board. Too much abstraction is when it's incomprehensible. To who is the next question (my usual complaint is that a junior should not be that level) and I think you're right to point out that the "who" here is everyone.
We're killing a whole side of creativity and elegance while only slightly aiding another side. There's utility to this, but also a cost.
I think what frustrates me most about CS is that as a community we tend to go all in on something. We went all in on VR then crypto, and now AI. We should be trying new things but it more feels like we take these sides as if they're objective and anyone not hopping on the hype train is an idiots or luddite. The way the whole industry jumps to these things just feels more like FOMO than intelligent strategy. Like making a sparkling water company an "AI first" company... its like we love solutions looking for problems
If you're getting flickers that's probably the best thing to do. This will cut your total power but you'll get a smoother signal. If it is visible then that is relatively low frequency but you might want to play around with it. The easiest thing to test (because you're more likely to have the parts accessible) is a pi-filter and then if that seems to be working you can either stick with that or use a better filter.
Edit: s/mining/migrating
reply