Welp. I already added a $20 Claude Pro subscription to complement my $10 Github Copilot Pro subscription and $10 DuckDuckGo Plus. That was partly to show support for Anthropic after the OpenAI/DOD episode, but also because I've been using Opus 4.5 exclusively with Copilot and I figured I should try Claude Code eventually.
Now it's going to cost me an upgrade to $39 Github Pro+ to keep using Opus, and even then it's with much higher multipliers. I don't fully understand the extent to which this reflects actual costs for Opus versus Microsoft leveraging network effects to discourage the usage of a competitor.
I didn't really want to wander outside of VSCode just yet because I was happy with VSCode/Copilot/Opus-4.5 and I don't want to spend all my time experimenting when stuff is changing so fast. But I guess my hand has been forced.
44% of uploads are probably not created by 44% of "artists". The core of people who are looking to exploit the system are going to be good at gaming the recommendation algorithm — they're specialists in it solely for the money who don't need to trouble themselves with artistic concerns.
I'm not saying it's impossible, but at a minimum it's extremely hard to game the recommendation algorithm (primarily talking about Spotify, maybe Deezer's is less sophisticated). The best way to "game" the recommendation algorithm, to kickstart a new/less-established artist profile, is to get onto popular playlists. However these playlists either have actual quality barriers (so they won't put AI slop music on) or they take $$ (so this doesn't really work with the "mass generated AI slop" approach).
> I recently deleted a whole bunch of automated tests because if the AI is going to write most of the code then I should test it to make sure it's good!
??
You say you deleted the tests, because you "should test it"? The logic seems inconsistent.
Sanity checking LLM-generated code with LLM-generated automated tests is low-cost and high-yield because LLMs are really good at writing tests.
I think LLMs are really bad at writing tests. In the good old days you invested in your test code to be structured and understandable. Now we all just say "test this thing you just generated".
I shipped a really embarrassing off-by-one error recently because some polygon representations repeat their last vertex as a sentinel (WKT, KML do this). When I checked the "tests", there was a generated test that asserted that a square has 5 vertices.
I suppose that my generalization was too broad and that LLMs can be either good or bad at writing tests depending on your workflow and expectations.
I'm closely supervising the LLM, giving it fine-grained instructions — I generally understand the full interface design and most times the whole implementation (though sometimes I skim). When I have the LLM write unit tests for me, it writes essentially what I would have written a couple years ago, except that it tends to be more thorough and add a few more tests I wouldn't have had the patience to write. That saves me quite a bit of time, and the LLM-generated unit tests are probably somewhat better than what I would have written myself.
I won't say that I never see brain-dead mistakes of the "5-vertex square" variety (haha) — by their nature, LLMs tend towards consistency rather than understanding after all. But I've been using Claude Opus exclusively for while and it doesn't tend to make those mistakes nearly as often as I used to see with lower-powered LLMs.
Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?
It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
I don't think there's a bias here. I'd say my task is of somewhat high complexity. I'm using Claude to assist me in implementing exceptions in my programming language. It's a SICP chapter 5.4 level task. There are quite a few moving parts in this thing. Opus 4.6 once went around in circles for half an hour trying to trace my interpreter's evaluator. As a human, it's not an easy task for me to do either.
I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7?
The problem I described occurred on Claude Code, Opus 4.7/1M, max effort, patched system prompts with all "don't think for simple stuff" instructions removed as well as CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 even though Opus 4.7 ignores it.
For my working style (fine-grained instructions to the agent), Opus 4.5 is basically ideal. Opus 4.6 and 4.7 seem optimized for more long-running tasks with less back and forth between human and agent; but for me Opus 4.6 was a regression, and it seems like Opus 4.7 will be another.
This gives me hope that even if future versions of Opus continue to target long-running tasks and get more and more expensive while being less-and-less appropriate for my style, that a competitor can build a model akin to Opus 4.5 which is suitable for my workflow, optimizing for other factors like cost.
For now, I'm planning to stick with Opus 4.5 as a driver in VSCode Copilot.
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
I've used Sonnet a lot. It is not as good as Opus at understanding what I'm asking for. I have to coach Sonnet more closely, taking more care to be precise in my prompts, and often building up Plan steps when I could just YOLO an Agent instruction at Opus and it would get it right.
I find that Opus is really good at discerning what I mean, even when I don't state it very clearly. Sonnet often doesn't quite get where I'm going and it sometimes builds things that don't make sense. Sonnet also occasionally makes outright mistakes, like not catching every location that needs to be changed; Opus makes nearly every code change flawlessly, as if it's thinking through "what could go wrong" like a good engineer would.
Sonnet is still better than older and/or less-capable models like GPT 4.1, Raptor mini (Preview), or GPT-5 mini, which all fail in the same way as Sonnet but more dramatically... but Opus is much better than Sonnet.
Recent full-powered GPTs (including the Codex variants) are competitive with Opus 4.6, but Opus 4.5 in particular is best in class for my workflow. I speculate that Opus 4.5 dedicates the most cycles out of all models to checking its work and ensuring correctness — as opposed to reaching for the skies to chase ambitious, highly complex coding tasks.
Hmm, maybe they're discouraging copilot+Claude through pricing, nudging people to anthropic suite of tools. That sucks. I've been super happy with copilot+opus/sonnet.
Cassettes are a pain. Head alignment is extremely important for analog tape fidelity, and it's always off for home recordings.
With pro analog tape recordings (e.g. 2-inch 24 track, half-inch 2-track), you record alignment tones onto the tapes to capture the state of the recording device, and then later calibrate the playback device to the particular tape so that playback alignment matches recording alignment. But this is essentially never done with cassettes, so you have to earball it.
Cassette players for mastering studios actually have alignment options (e.g. adjustable azimuth) that aren't present on consumer devices. But without the tones, you have to guess.
The problem with starting from a digitized source is that it may have been digitized from non-aligned playback. Ideally you want to go back to the analog originals - but old cassettes are rarely in perfect condition.
Interestingly, the Nakamichi Dragon is/was a cassette deck that can do automatic azimuth adjustment on playback -- without having recorded tones to work with.
In loose terms: It does this with a special read head that splits one of the recorded tracks into 2 distinct signals (for a total of 3 signals from 1 stereo recording). The split tracks' signals are compared, and it adjusts the azimuth (by minutely rotating the head) until the signals from the split track match most-perfectly.
(Take note of the pictures of the machine. If anyone finds one sitting around at a flea market or in a forgotten pile of old junk, please rescue it. Nothing like this will ever be manufactured again. Even if the condition is "it looks like someone went after it with a big hammer as part of their anger management process," the bits that remain still have significant value and are easy to sell.)
I'm glad you called out the Dragon. Besides being an impressive piece of engineering, it's a beautiful piece of art. One of the most striking pieces of consumer electronics I've ever seen.
I feel like "pain" is a strong word here. It was a book spoiler. I wasn't laughing at people being punched or hurt or anything.
I acknowledge it's a dick move, but it really is just a spoiler for a book, not exactly life ruining and really shouldn't even be day-ruining. I had the book spoiled for me too and it was just something I moved on from, somehow.
> not exactly life ruining and really shouldn't even be day-ruining.
These were people who lined up for hours outside of a bookstore still waiting to get in after midnight. Many of those people were young and the lack of perspective and experience at young ages often results in assigning a disproportionate weight to the emotional events they experience. An event like that might not have been day-ruining for you, but I have no doubt that there were people who were genuinely hurt by it.
Maybe you forget the absolute hysteria around these books. People were passionate about learning what happened next, and incredibly excited for the reveal to happen organically.
This was done because it was the easiest way to massively distribute pain to people about a known weak spot. It was mean spirited, anti-social, and honestly indefensible.
To be clear, I'm not sitting in judgment of you or any of the other spoiler trolls, not back then and certainly not now. This is an instance where the Potter-philes couldn't fight back, and to my mind that's inevitably going to bring out the worst in human nature.
Elsethread, you mention that "some people really got upset.". In some sense, the more upset they get, the more successful the troll and the funnier it gets, right? At least, it feels funny to me, at the same time as it also feels bad to imagine upset kids, at the same time as feeling that upset kids learning that other humans are cruel is a necessary part of growing up.
You're not equipped to know what the right word is for anyone but yourself. You go through life curious about the vast diversity of mindsets rather than assuming they're homogenous.
I think it's extremely hard to argue that kids tend to be emotionally immature and especially vicious in this regard. But considering the GP has admitted that in retrospect they find this action to be a dick move I think it's important not to try and generalize immature behavior to all of humanity.
The question of whether humans are more biased towards social or antisocial behavior[1] is a complex one that philosophy has struggled with for a long time without a clear consensus.
1. Often historically framed as whether humans are inherently good or evil.
There's never going to be philosophical consensus on the "good/evil/social/antisocial" debate because the human impulse to self-justify and believe that you're the "good guy" is extremely powerful. Those of us who seek to understand human nature have to proceed without consensus as a goal.
Mao Zedong was able to convince kids and teenagers to have their parents and teachers killed during the Cultural Revolution by convincing them that it was prosocial behavior, and indeed their duty. So the question is fraught with conundrums of the form "humans tend to prosocial/antisocial according to which standard?"
I'd rather remind people that only a very small number of dicks have any desire or interest in inflicting pain on their fellow humans even when there were no consequences.
Assholes do exist, and you should be aware of them, but assholes are a tiny minority of the population. There are far more people who aren't assholes, and an even greater number of people who are just doing their own thing and can't be bothered to go through the effort to hurt others just for kicks.
Pricing mistakes which make the supermarket money are unfortunate but low priority. Pricing mistakes which cost the supermarket money must be fixed immediately.
Now it's going to cost me an upgrade to $39 Github Pro+ to keep using Opus, and even then it's with much higher multipliers. I don't fully understand the extent to which this reflects actual costs for Opus versus Microsoft leveraging network effects to discourage the usage of a competitor.
I didn't really want to wander outside of VSCode just yet because I was happy with VSCode/Copilot/Opus-4.5 and I don't want to spend all my time experimenting when stuff is changing so fast. But I guess my hand has been forced.
reply