Hacker Newsnew | past | comments | ask | show | jobs | submit | maccard's commentslogin

I feel like we’ve been hearing this for 4 years now. The improvements to programming (IME) haven’t come from improved models, they’ve come from agents, tooling, and environment integrations.

> I feel like we’ve been hearing this for 4 years now.

I feel we were hearing very similar claims 40 years ago, about how the next version of "Fourth Generation Languages" were going to enable business people and managers to write their own software without needing pesky programmers to do it for them. They'll "just" need to learn how to specify the problem sufficiently well.

(Where "just" is used in it's "I don't understand the problem well enough to know how complicated or difficult what I'm about to say next is" sense. "Just stop buying cigarettes, smoker!", "Just eat less and exercise more, fat person!", "Just get a better paying job, poor person!", "Just cheer up, depressed person!")


> The improvements to programming (IME) haven’t come from improved models, they’ve come from agents, tooling, and environment integrations.

I disagree. This almost entirely model capability increases. I've stated this elsewhere: https://news.ycombinator.com/item?id=46362342

Improved tooling/agent scaffolds, whatever, are symptoms of improved model capabilities, not the cause of better capabilities. You put a 2023-era model such as GPT-4 or even e.g. a 2024-era model such as Sonnet 3.5 in today's tooling and they would crash and burn.

The scaffolding and tooling for these models have been tried ever since GPT-3 came out in 2020 in different forms and prototypes. The only reason they're taking off in 2025 is that models are finally capable enough to use them.


Yet when you compare the same model in 2 different agents you can easily see capability differences. But cross (same tier) model in the same agent is much less stark.

My personal opinion is that there was a threshold earlier this year where the models got basically competent enough to be used for serious programming work. But all the major on the ground improvements since then has gone from the agents, and not all agents are equal, while all sota models are effectively.


> Yet when you compare the same model in 2 different agents you can easily see capability differences.

Yes definitely. But this is to be expected. Heck take the same person and put them in two different environments and they'll have very different performance!

> But cross (same tier) model in the same agent is much less stark.

Unclear what you mean by this. I do agree that the big three companies (OpenAI, Anthropic, Google DeepMind) are all more or less neck and neck in SOTA models, but every new generation has been a leap. They just keep leaping over each other.

If you compare e.g. Opus 4.1 and Opus 4.5 in the same agent harness, Opus 4.5 is way better. If you compare Gemini 3 Pro and Gemini 2.5 Pro in the same agent harness, Gemini 3 is way better. I don't do much coding or benchmarking with OpenAI's family of models, but anecdotally have heard the same thing going from GPT-5 to GPT-5.2.

The on the ground improvements have been coming primarily from model improvements, not harness improvements (the latter is unlocked by the former). Again, it's not that there were breakthroughs in agent frameworks that happened; all the ideas we're seeing now have all been tried before. Models simply weren't capable enough to actually use them. It's just that more and more (pre-tried!) frameworks are starting to make sense now. Indeed, there are certain frameworks and workflows that simply did not make sense with Q2-Q3 2025 models that now make sense with Q4 2025 models.


I actually have spent a lot of time doing comparisons between the 4.1 and 4.5 Claude models (and lately the 5.1->5.2 chatgpt models) and for many many tasks there is not significant improvement.

All things being equal I agree that the models are improving, but for many of the tasks I’m testing what has the most improvement is the agent. The agents choosing the appropriate model for the task for instance has been huge.

I do believe there is beneficial symbiosis but for my results the agent's provide much bigger variance than the model.


Both is true, models have also been significantly improved in the last year alone, let's not even talk about 4 years ago. Agents, tooling and other sugar on top is just that - enabling more efficient and creative usage, but let's not undermine how much better models today are compared to what was available in the past.

How do you judge model improvements vs tooling improvements?

If not working at one of the big players or running your own, it appears that even the APIs these days are wrapped in layers of tooling and abstracting raw model access more than ever.


> even the APIs these days are wrapped in layers of tooling and abstracting raw model access more than ever.

No, the APIs for these models haven't really changed all that much since 2023. The de facto standard for the field is still the chat completions API that was released in early 2023. It is almost entirely model improvements, not tooling improvements that are driving things forward. Tooling improvements are basically entirely dependent on model improvements (if you were to stick GPT-4, Sonnet 3.5, or any other pre-2025 model in today's tooling, things would suck horribly).


The code that's generated when given a long leash is still crap. But damned if I didn't use a JIRA mcp and a gitlab mcp, and just have the corporate AI just "do" a couple of well defined and well scoped tickets, including interacting with JIRA to get the ticket contents, update its progress, push to gitlab, and open an MR. Then, the corporate CodeRabbit does a first pass code review against the code so any glaring errors are stomped out before a human can review it. What's more scary though is that the JIRA tickets were created by a design doc that was half AI generated in the first place. The human proposed something, the AI asked clarifying questions, then broke the project down into milestones and then tickets, and then created the epic and issues on JIRA. One of my tradie friends taking an HVAC class tells me that there are a couple of programmers in his class looking to switch careers. I don't know what the future brings, but those programmers (sorry, "software developers") may have the right idea.

Yes we get it, there is a ton of "work" being done in corporate environments, in which the slop that generative AI churns out is similar to the slop that humans churn out. Congrats.

Using Grep or regex is textual refactoring. If you want to rename every reference to a type Foo, how do you is that without touching any variables named foo, or any classes named FooBar

The answer is use tools that have semantic info to rename things.


I often want them to rename all the textual references too because otherwise you have bunch of variables using the old name as a reference.

Even though it has no semantic significance to the compiler, it does for all the human beings who will read it and get confused.


The parent is comparing it to, e.g. jetbrains git integration.

You get a prompt on the terminal. I’ve never had a cashier suggest anything to me, and I don’t really want their input. The correct answer is always pay in local currency and let your bank handle it.

I once came across a cashier that thought you had to select the foreign currency option. When I tried to pay in the local currency she cancelled the transaction.

Needed to get another member of staff to explain to her that the local currency option would work fine.


I’m not defending this behaviour with Ryanair, but this is not unique to them at all. It’s an industry “standard”. I’m Irish but live in the UK - when we make card transactions it asks what currency we want to pay in, and hides the exchange rate spread.

> I will only use them if I have literally no other choice

Even with the £20 increase they were likely cheaper than the alternative, if it exists. If this is going to push you into not using them, basically every other airline will be ruled out for you. EasyJet are exactly the same. BA/KLM/Air France/Aer Lingus are all the same on their short hop flights (I’ve actually never flown Lufthansa so I can’t comment on them). The short haul European routes are a race to the bottom.


To be clear, the currency scam was a last straw, not the major dark pattern.

When you compare list prices for flights with them versus almost any other airline you are comparing apples with oranges. The only way to figure out exactly what you'll pay is to go through the entirety of their checkout procedure. My experiences with those other airlines for short haul flights are quite different.


I also hate that it continues through the whole flight. I don't want to find out I have to pay to have my boarding pass printed, or that I need to pay for a glass of water on the plane. The other carrier might be more, but the things that come in the bundled fare make the trip easier with less friction points.

> Even with the £20 increase they were likely cheaper than the alternative, if it exists.

Honestly, on many routes, I think this is true far less often than it used to be.


Aurora serverless requires provisioned compute - it’s about $40/mo last time I checked.

The performance disparity is just insane.

Right now from Hetzner you can get a dedicated server with 6c/12t Ryzen2 3600, 64GB RAM and 2x512GB Nvme SSD for €37/mo

Even if you just served files from disc, no RAM, that could give 200k small files per second.

From RAM, and with 6 dedicated cores, network will saturate long before you hit compute limits on any reasonably efficient web framework.


Does it?

I use AI as a smart auto complete - I’ve tried multiple tools on multiple models and I still _regularlt_ have it dump absolute nonsense into my editor - in thr best case it’s gone on a tangent, but in the most common case it’s assumed something (often times directly contradicting what I’ve asked it to do), gone with it, and lost the plot along the way. Of course when I correct it it says “you’re right, X doesn’t exist so we need to do X”…

Has it made me faster? Yes. Had it changed engineering - not even close. There’s absolutely no world where I would trust what I’ve seen out of these tools to run in the real world even with supervision.


When you have that hair raising “am I crazy why are people touting ai” feeling, it’s good to look at their profile. Oftentimes they’re caught up in some ai play. Also it’s good to remember yc has heavy investment in gen ai so this site is heavily biased

Context is king, too: in greenfield startups where you care little about maintenance and can accept redundant front end frameworks and backend languages? I believe agent swarms can poop out a lot lot lot of code relatively quick… Copy and paste is faster though. Downloading a repo is very quick.

In startups I’ve competed against companies with 10x and 100x the resources and manpower on the same systems we were building. The amount of code they theoretically could push wasn’t helping them, they were locked to the code they actually had shipped and were in a downwards hiring spiral because of it.


Here’s the thing - an awful lot of it doesn’t even compile/run, never mind do the right thing. My most recent example was asking it to use terraform to run an azure container app with an environment variable in an existing app environment. It repeatedly made up where the environment block goes, and and cursor kept putting the actual resource in random places in the file.

There’s a couple of providers that give you that kind of abstraction. Playfab is _pretty close_ but it’s fairly slow to ramp up and down. There is/was multiplay - they’ve had some changes recently and I’m not sure what their situation is right now. There’s also stuff like Hathora (they’re great but expensive).

At a previous job, we used azure container apps - it’s what you _want_ fargate to be. AIUI, Google Cloud Run is pretty much the same deal but I’ve no experience with it. I’ve considered deploying them as lambdas in the past depending on session length too…


Cloud Run tries to be this but every service like this has quirks. For example, GCR doesn’t let you deploy to high-CPU/MEM instances, has lower performance due to multi-tenant hosts, etc

But that’s not what OP asked for. They asked for

> As a hobbyist part of me wants the VM abstracted completely (which may not be realistic). I want to say “here’s my game server process, it needs this much cpu/mem/network per unit, and I need 100 processes” and not really care about the underlying VM(s), at least until later. The closest thing I’ve found to this is AWS fargate.

You can’t have on demand usage with no noisy neighbours without managing the underlying VMs.

I used hathora [0] at my previous job, (they’ve expanded since and I’m not sure how much this applies anymore) - they had a CLI tool which took a dockerfile and a folder and built a container and you could run it anywhere globally after that. Their client SDK contained a “get lowest latency location” that you could call on startup to use. It was super neat, but quite expensive!

[0] https://hathora.dev/gaming


Have you written any go code? it's the closest I've come to actually enjoying a type system - it gets out of your way, and loosely enforces stuff. It could do with some more convenience methods, but overall I'd say it's my most _efficient_ type system. (not necessarily the best)

I was underwhelmed by uv as a tool when it was announced, and when I started using it. For context, I'm a C++ developer who occasionally has to dip into python-land for scripts and tooling. I set up a new workstation about 6 months ago and decided I'd just use pip + venv again, and honestly I lasted 2 weeks before installing UV again. It's one of those tools that... doesn't really do much except _what you wanted the original tool to do_, and I'm hoping that Ty has the same effect.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: