Hacker Newsnew | past | comments | ask | show | jobs | submit | deepsquirrelnet's commentslogin

My tinfoil hat theory, which may not be that crazy, is that providers are sandbagging their models in the days leading up to a new release, so that the next model "feels" like a bigger improvement than it is.

An important aspect of AI is that it needs to be seen as moving forward all the time. Plateaus are the death of the hype cycle, and would tether people's expectations closer to reality.


Possibly due to moving compute from inference to training

My purely unfounded, gut reaction to Opus 4.7 being released today was "Oh, that explains the recent 4.6 performance - they were spinning up inference on 4.7."

Of course, I have no information on how they manage the deployment of their models across their infra.


I was there too, but honestly after today, 4.7 "feels" just as a bad. I was cynical, but also, kind of eager for the improvement. It's just not there. Compared to early Feb, I have to babysit EVERYTHING.

> Cuyahoga Valley: There is nothing wrong with Cuyahoga Valley. Statistically, you’re from Ohio, so why not?

In college, I took an interim elective course on geology of the national parks. On the first day of class, the professor asked an icebreaker for students to say which national park they lived closest to. I said Ohio - Cuyahoga Valley.

Well some snot nosed boy scout confidently piped up that there were mostly certainly no national parks in Ohio, and the professor agreed. This is a deep personal grudge that I still hold to this day.


Dry-nosed Eagle Scout here to relieve you of your grudge. There is of course as you know a national park in Ohio and it is wonderful. Grew up right along its edge, and I'm forever grateful for it!

Ah, Baden-Powell's Principle proves itself true yet again: How can you tell if somebody was an Eagle Scout? Don't worry, he'll tell you immediately.

CnakeCharmer - https://github.com/dleemiller/CnakeCharmer

https://huggingface.co/datasets/CnakeCharmer/CnakeCharmer

This project started from a belief that llms should be better at doing python to cython code translations than they are. So we started setting a large set of parallel implementations.

Then I realized that Claude code was much better at working on the data using tools (mcp) to check and iterate. The scope transformed into an platform for creating the SFT agentic trace dataset using sandboxed tools for compilation, testing, linting, address sanitizing and benchmarking.

We still need to bulk up the GRPO dataset with a large number of good unmatched python examples. But early results using SFT only on gpt-oss 20b are quite good.


There are so many reason if you look at how it's being sold.

* We need to completely deregulate these US companies so China doesn't win and take us over

* We need to heavily regulate anybody who is not following the rules that make us the de-facto winner

* This is so powerful it will take all the jobs (and therefore if you lead a company that isn't using AI, you will soon be obsolete)

* If you don't use AI, you will not be able to function in a future job

* We need to lineup an excuse to call our friends in government and turn off the open source spigot when the time is right

They have chosen fear as a motivator, and it is clearly working very well. It's easier to use fear now, while it's new and then flip the narrative once people are more familiar with it than to go the other direction. Companies are not just telling a story to hype their product, but why they alone are the ones that should be entrusted to build it.


I am working on a large scale dataset for producing agent traces for Python <> cython conversion with tooling, and it is second only to gemini pro 3.1 in acceptance rates (16% vs 26%).

Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).

I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.


Good article, and I think the "evolution of every AI system" is spot on.

In my opinion, the reason people don't use DSPy is because DSPy aims to be a machine learning platform. And like the article says -- this feels different or hard to people who are not used to engineering with probabilistic outputs. But these days, many more people are programming with probability machines than ever before.

The absolute biggest time sink and 'here be dragons' of using LLMs is poke and hope prompt "engineering" without proper evaluation metrics.

> You don’t have to use DSPy. But you should build like someone who understands why it exists.

And this is the salient point, and I think it's very well stated. It's not about the framework per se, but about the methodology.


yeah this is the main point I wanted to get across! I rarely recommend people to use Dspy; but I think Dspy is often so polarizing that people "throw out the baby with the bathwater". They decide not to use Dspy, but also don't learn from the great ideas it has!


This is even after the Hindenburg research report that found numerous screaming red flags a few years ago.

https://hindenburgresearch.com/smci/


I worked at Micron in the SSD division when Optane (originally called crosspoint “Xpoint”) was being made. In my mind, there was never a real serious push to productize it. But it’s not clear to me whether that was due to unattractive terms of the joint venture or lack of clear product fit.

There was certainly a time when it seemed they were shopping for engineers opinions of what to do with it, but I think they quickly determined it would be a much smaller market anyway from ssds and didn’t end up pushing on it too hard. I could be wrong though, it’s a big company and my corner was manufacturing and not product development.


I worked at Intel for a while and might be able to explain this.

There were/are often projects that come down from management that nobody thinks are worth pursuing. When i say nobody, it might not just be engineers but even say 1 or 2 people in management who just do a shit roll out. There are a lot of layers of Intel and if even one layer in the Intel Sandwich drag their feet it can kill an entire project. I saw it happen a few times in my time there. That one specific node that intel dropped the ball on kind of came back to 2-3 people in one specific department, as an example.

Optane was a minute before I got there, but having been excited about it at the time and somewhat following it, that's the vibe I get from Optane. It had a lot of potential but someone screwed it up and it killed the momentum.


Are you referring to the Intel 10nm struggles in your reference to 2-3 people?


This is actually insane. Do you mean 2-4 people in one department basically killed Intel? Roll to disbelief.


Yes this is pretty common in large enterprise-ey tech companies that are successful. There are usually a small group of vocal members that have a strong conviction and drive to make a vision a reality. This is contrary to popular belief that large companies design by committee.

Of course it works exceptionally well when the instinct turns out to be right. But can end companies if it isn’t.


It's somewhat plausible that a small group of people in one department were responsible for the bad bets that made their 10nm process a failure. But it was very much a group effort for Intel to escalate that problem into the prolonged disaster. Management should have stopped believing the undeliverable promises coming out of their fab side after a year or two, and should have started much sooner to design chips targeting fab processes that actually worked.


A friend was working at Micron on a rackmount network server with a lot of flash memory, I didn't ask at the time what kind of flash it used. The project was cancelled when nearly finished.


Interesting read! I love to see this spirit. I grew up with a different - but similar experience. Only, as an 80s and 90s kid, computers were nothing but limitations. Even when my dad built a machine with a 133MHz Cyrix chip, already a year later, it was outdated by computers with literally double the computing power.

That Cyrix machine was already miles ahead of the 386 that was handed down to me to play text based games on and learn dos through hard knocks. I remember leafing through old hard drives that had 10mb of capacity and realizing they had no value despite not being that old.

Later in college, I had the confidence to build my own first desktop with parts cobbled together from sketchy resellers. Athlon A1 single core 1ghz. Man that thing could fly.


> I tapped into Pangram. Pangram is a remarkably good, conservative model for detecting LLM-generated text. These detectors have a bad rep among techies, but the objections are often based on outdated assumptions

Turing test is really in the rearview, huh?

Humans need machines to detect if a machine wrote the text, because humans aren’t sure.


In the Turing tent, you converse witt the other side instead of just reading static text. This makes a huge difference.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: