there are many causes, but it’s a drift in performance
you can drift a tool via the harness in many ways
you can modify the system prompt
you can modify the underlying model powering the harness
you can use different “thinking” levels for different processes in the harness
you can change the entire way a system works via the harness, which could be better or worse, depending on many things
you can introduce anti-anti-slop within the harness to foil attempts from users using patch scripts
you can modify how your tool sends requests to your server depending on many variables
you can handle requests differently, depending on any variable of your choosing, at the server level
you can modify the compute allotment per user depending on many things, from the backend, without telling the user, it’s very easy. you can modify it dynamically depending on your own usage or the user’s cycle. Or their organization’s priority level as a customer. The weekly and daily usage management system is intricate, compute is very finite and must be managed
the user has literally no way to know and you have no legal obligation to tell them, you never made them any legally binding promises
the combination of so many factors that all affect each other means that you can, if you’d want to, create a new clusterfuck of an experience anytime any of these or unknown variables change, it may not even be deliberate, it grows exponentially complex, so you may not even be able to promise a specific standard to your users
drift is not imagined, sure, but admitting to it could expose you to unneeded liability
That's a lot of words without actually defining the term, although idle_zealot's suggestion of "change" seems to make grammatical sense as a replacement here.
Hmm.. I've had some customers be gamblers. It's kind of sad to see. These are like middle aged dads of various economic classes that are desperately chasing a high when they should be focused on their families. To me, gambling and porn are yet more strains on the most important social institution: the family. It's fun, but it's bad for society, for those who care about that
what's wrong in admitting you don't know something for a fact? i would love to see some proof for mythos or a white paper or something
smaller companies, even startups, are held to much much higher standards
is anthropic somehow immune? what have they done to earn that immunity? what good will, good stewardship, good faith have they shown to the developer community in the past few quarters?
The developer community is wildly fickle. They turn on you at the drop of a hat if you don't puritanically adhere to what they want. The question isn't "what have they done for the developer community" (no one working at a real company gives a shit), the question is "are they lying about Mythos".
I don't see why Mozilla would write that blogpost if they were. Is Mozilla lying too now?
You don't know what Mozilla got access to. They may just be covering their own asses.
My hunch is that it's a marketing ploy. I don't trust a company that says they can protect others if they let their own tools leak, it feels like logic to me, am I wrong?
I don't understand why you're stuck on the word lie?
These are both true statements:
- We've just developed our new top model for agentic coding
- We've just developed a model capable of finding cybersecurity vulnerabilities at a scale never before seen
The problem is/was when you say the 1st statement, you're saying something that everyone says. OpenAI said something similar for 5.5 just this morning. Once you loudly frame your release in the latter terms, you're not lying... but you're being very intentional in trying to grab headlines.
Every top release from a frontier lab now enables the same thing. That's why we've already had response-level filters on cybersecurity for months now from both OpenAI and Anthropic.
Technically every time either has released a top model for the last several months they've been "enabling automated cybersecurity penetration at a scale never before seen.": it was Anthropic that decided to quadruple down on the language and create a ton of buzz.
But OpenAI today showed that the existing cybersecurity mitigations already addressed the concern of misuse. Anthropic has the same (or even stricter) detection for widescale automated attacks and could have used it to ship Mythos if not for the marketing points.
The request definitely comes from the leagues' broadcast partners, right? They would want as many eyeballs concentrated in as few places as possible so they can sell ads for more.
The alternative is probably also true. If your F500 competitor is also handicapped by AI somehow, then you're all stagnant, maybe at different levels. Meanwhile Anthropic is scooping up software engineers it supposedly made irrelevant with Mythos and moving into literally 2+ new categories per quarter
These people have always existed. Hell, they are here, too. Now they have a new thing to delegate responsibility to.
And no, I don't understand them at all. Taking responsibility for something, improving it, and stewarding it into production is a fantastic feeling, and much better than reading the comment section. :)
Saudi will host the biggest data centers in the world
reply