Hacker Newsnew | past | comments | ask | show | jobs | submit | mrbungie's commentslogin

They don't have the know how (except by proxy via OpenAI) nor custom hardware and somehow they are even worse at integrating AI into their products than Google.

They don’t need to. Just like Amazon they are seeing record revenues from Azure because of their third party LLM hosting platforms only being gated because no one can get enough chips right now

Was this "paper" eventually peer reviewed?

PS: I know it is interesting and I don't doubt Antrophic, but for me it is so fascinating they get such a pass in science.


This is more of an article describing their methodology than a full paper. But yes, there's plenty of peer reviewed papers on this topic, scaling sparse autoencoders to produce interpretable features for large models.

There's a ton of peer reviewed papers on SAEs in the past 2 years; some of them are presented at conferences.

For example: "Sparse Autoencoders Find Highly Interpretable Features in Language Models" https://proceedings.iclr.cc/paper_files/paper/2024/file/1fa1...

"Scaling and evaluating sparse autoencoders" https://iclr.cc/virtual/2025/poster/28040

"Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning" https://proceedings.neurips.cc/paper_files/paper/2024/hash/c...

"Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" https://aclanthology.org/2024.blackboxnlp-1.19.pdf


Modern ML is old school mad science.

The lifeblood of the field is proof-of-concept pre-prints built on top of other proof-of-concept pre-prints.


Sounds like you agree this “evidence” lacks any semblance of scientific rigor?

(Not GP) There was a well recognized reproducibility problem in the ML field before LLM-mania, and that's considering published papers with proper peer-reviews. The current state of afairs in some ways is even less rigourous than that, and then some people in the field feel free to overextend their conclusions into other fields like neurosciences.

Frankly, I don't see a reason to give a shit.

We're in the "mad science" regime because the current speed of progress means adding rigor would sacrifice velocity. Preprints are the lifeblood of the field because preprints can be put out there earlier and start contributing earlier.

Anthropic, much as you hate them, has some of the best mechanistic interpretability researchers and AI wranglers across the entire industry. When they find things, they find things. Your "not scientifically rigorous" is just a flimsy excuse to dismiss the findings that make you deeply uncomfortable.


> full featured f/oss alternatives.

Assuming this comes from lower barriers of entry to software engineering skills at scale with LLMs, this is still begs the question: Who will pay for the tokens? One thing is giving away your free time for passion, other one is giving away money.

Maybe we'll see a future were people crowdsource projects supporting them directly via donations for tokens/LLM queries.


Tokens aren’t that expensive.

I built a CapRover clone that’s actually free software for <$1k. I imagine it wouldn’t be much more to modify a fork of Mattermost to add in their pay-gated features like SSO and message expiry etc.


> people crowdsource projects supporting them directly via donations for tokens/LLM queries.

Is this perhaps happening today? Large open source projects where llm could deliver the code.. e.g. I want an home assistant to connect to something that perhaps isn't mainstream but used by a dozen users. Those dozen users fund the PR via token budget?


Do you not value your time? Paying a 100 bucks for a Claude max subscription is well worth it


Opportunity costs: Would you rather pay 100 bucks for making more money or for your foss projects?

The same can be said of your time, but here we're talking about scale benefits due to LLMs (i.e. lots of SaaSs dying due to lots of "full featured f/oss projects").


> What's Anthropic's optimization target??? Getting you the right answer as fast as possible!

Are you totally sure they are not measuring/optimizing engagement metrics? Because at least I can bet OpenAI is doing that with every product they have to offer.


If you are really good and fast validating/fixing code output or you are actually not validating it more than just making sure it runs (no judging), I can see it paying out 95% of the time.

But for what I've seen both validating my and others coding agents outputs I'd estimate a much lower percentage (Data Engineering/Science work). And, oh boy, some colleages are hooked to generating no matter the quality. Workslop is a very real phenomenon.


This matches my experience using LLMs for science. Out of curiosity, I downloaded a randomized study and the CONSORT checklist, and asked Claude code to do a review using the checklist.

I was really impressed with how it parsed the structured checklist. I was not at all impressed by how it digested the paper. Lots of disguised errors.


try codex 5.3. it's dry and very obviously AI; if you allow a bit of anthropomorphisation, it's kind of high-functioning autistic. it isn't an oracle, it'll still be wrong, but it's a powerful, completely different from claude tool.


Does it get numbers right? One of the mistakes it made in reading the paper was swapping sets of numbers from the primary/secondary outcomes.


it does get screenshots right for me, but obviously I haven't tried on your specific paper. I can only recommend trying it out, it's also has a much more generous limits in the $20 tier than opus.


I see. To clarify, it parsed numbers in the pdf correct, but assigned them the wrong meaning. I was wondering if codex is better at interpreting non text data


Every time someone suggests Codex I give it a shot. And every time it disappoints.

After I read your comment, I gave Codex 5.3 the task of setting up an E2E testing skeleton for one of my repos, using Playwright. It worked for probably 45 minutes and in the end failed miserably: out of the five smoke tests it created, only two of them passed. It gave up on the other three and said they will need “further investigation”.

I then stashed all do that code and gave the exact same task to Opus 4.5 (not even 4.6), with the same prompt. After 15 mins it was done. Then I popped Codex’s code from the stash and asked Opus to look at it to see why the three m of the five tests Codex wrote didn’t pass. It looked at them and found four critical issues that Codex had missed. For example, it had failed to detect that my localhost uses https, so the the E2E suite’s API calls from the Vue app kept failing. Opus also found that the two passing tests were actually invalid: they checked for the existence of a div with #app and simply assumed it meant the Vue app booted successfully.

This is probably the dozenth comparison I’ve done between Codex and Opus. I think there was only one scenario where Codex performed equally well. Opus is just a much better model in my experience.


moral of the story is use both (or more) and pick the one that works - or even merge the best ideas from generated solutions. independent agentic harnesses support multi-model workflows.


I don't think that's the moral of the story at all. It's already challenging enough to review the output from one model. Having to review two, and then comparing and contrasting them, would more than double the cognitive load. It would also cost more.

I think it's much more preferable to pick the most reliable one and use it as the primary model, and think of others as fallbacks for situations where it struggles.


you should always benchmark your use cases and you obviously don't review multiple outputs; you only review the consensus.

see how perplexity does it: https://www.perplexity.ai/hub/blog/introducing-model-council


Yet even Anthropic has shown the downsides to using them. I don't think it is a given that improvements in models scores and capabilities + being able to churn code as fast as we can will lead us to a singularity, we'll need more than that.


My late grandma learnt how to use an iPad by herself during her 70s to 80s without any issues, mostly motivated by her wish to read her magazines, doomscroll facebook and play solitaire. Her last job was being a bakery cashier in her 30s and she didn't learn how to use a computer in-between, so there was no skill transfer going on.

Humans and their intelligence are actually incredible and probably will continue to be so, I don't really care what tech/"think" leaders wants us to think.


Pretty edgy response. I'd say trying to scale in price rather than in quantity is a bad business strategy for tech period, specially if you hope to become Google-sized like OpenAI and company want.


Why would you need a GPU for an AI managed instance? I guess it would useful for some workloads, but arguably not for most really.


Well, this is a good example of "Shareholder value != customer value".


Als also shareholder value != Positive effect on society


You know what? I bet if you got rid of stock buybacks, there'd be more consequences for making a shit product.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: