For seasoned maintainers of open source repos, there is explicit evidence it does slow them down, even when they think it sped them up: https://arxiv.org/abs/2507.09089
Cue: "the tools are so much better now", "the people in the study didn't know how to use Cursor", etc. Regardless if one takes issue with this study, there are enough others of its kind to suggest skepticism regarding how much these tools really create speed benefits when employed at scale. The maintenance cliff is always nigh...
There are definitely ways in which LLMs, and agentic coding tools scaffolded in top, help with aspects of development. But to say anyone who claims otherwise is either being disingenuous or doesn't know what they are doing, is not an informed take.
I have seen this study cited enough to have a copy paste for it. And no, there are not a bunch of other studies that have any sort of conclusive evidence to support this claim either. I have looked and would welcome any with good analysis.
"""
1. The sample is extremely narrow (16 elite open-source maintainers doing ~2-hour issues on large repos they know intimately), so any measured slowdown applies only to that sliver of work, not “developers” or “software engineering” in general.
2. The treatment is really “Cursor + Claude, often in a different IDE than participants normally use, after light onboarding,” so the result could reflect tool/UX friction or unfamiliar workflows rather than an inherent slowdown from AI assistance itself.
3. The only primary outcome is self-reported time-to-completion; there is no direct measurement of code quality, scope of work, or long-term value, so a longer duration could just mean “more or better work done,” not lower productivity.
4. With 246 issues from 16 people and substantial modeling choices (e.g., regression adjustment using forecasted times, clustering decisions), the reported ~19% slowdown is statistically fragile and heavily model-dependent, making it weak evidence for a robust, general slowdown effect.
"""
Any developer (who was a developer before March 2023) that is actively using these tools and understands the nuances of how to search the vector space (prompt) is being sped up substantially.
I think we agree no the limitations of the study--I literally began my comment with "for seasoned maintainers of open source repos". I'm not sure if in your first statement ("there are no studies to back up this claim.. I welcome good analysis") you are referring to claims that support an AI-speedup. If so, we agree that good analysis is needed. But if you think there already is good data:
Can you link any? All I've seen is stuff like Anthropic claiming 90% of internal code is written by Claude--I think we'd agree that we need an unbiased source and better metrics than "code written". My concern is that whenever AI usage in professional developers is studied empirically, as far as I have seen, the results never corroborate your claim: "Any developer (who was a developer before March 2023) that is actively using these tools and understands the nuances of how to search the vector space (prompt) is being sped up substantially."
I'm open to it being possible, but as someone who was a developer before March 2023 and is surrounded by many professionals who were also so, our results are more lukewarm than what I see boosters claim. It speeds up certain types of work, but not everything in a manner that adds up to all work "sped up substantially".
I need to see data, and all the data I've seen goes the other way. Did you see the recent Substack looking at public Github data showing no increase in the trend of PRs all the way up to August 2025? All the hard data I've seen is much, much more middling than what people who have something to sell AI-wise are claiming.
"major architectural decisions don't get documented anywhere"
"commit messages give no "why""
This is so far outside of common industry practices that I don't think your sentiment generalizes. Or perhaps your expectation of what should go in a single commit message is different from the rest of us...
LLMs, especially those with reasoning chains, are notoriously bad at explaining their thought process. This isn't vibes, it is empiricism: https://arxiv.org/abs/2305.04388
If you are genuinely working somewhere where the people around you are worse than LLMs at explaining and documenting their thought process, I would looking elsewhere. Can't imagine that is good for one's own development (or sanity).
I've worked everywhere from small startups to megacorps. The megacorps certainly do better with things like initial design documents that startups often skip entirely, but even then they're often largely out-of-date because nobody updates them. I can guarantee you that I am talking about common industry practices in consumer-facing apps.
I'm not really interested in what some academic paper has to say -- I use LLM's daily and see first-hand the quality of the documentation and explanations they produce.
I don't think there's any question that, as a general rule, LLM's do a much better job documenting what they're doing, and making it easy for people to read their code, with copious comments explaining what the code is doing and why. Engineers, on the other hand, have lots of competing priorities -- even when they want to document more, the thing needs to be shipped yesterday.
Alright, I'm glad to hear you've had a successful and rich professional career. We definitely agree that engineers generally fail to document when they have competing priorities, and that LLMs can be of use to help offload some of that work successfully.
Your initial comment made it sound like you were commenting on a genuine apples-for-apples comparisons between humans and LLMs, in a controlled setting. That's the place for empiricism, and I think dismissing studies examining such situations is a mistake.
A good warning flag for why that is a mistake is the recent article that showed engineers estimated LLMs sped them up by like 24%, but when measured they were actually slower by 17%. One should always examine whether or not the specifics of the study really applies to them--there is no "end all be all" in empiricism--but when in doubt the scientific method is our primary tool for determining what is actually going on.
But we can just vibe it lol. Fwiw, the parent comment's claims line up more with my experience than yours. Leave an agent running for "hours" (as specified in the comment) coming up with architectural choices, ask it to document all of it, and then come back and see it is a massive mess. I have yet to have a colleague do that, without reaching out and saying "help I'm out of my depth".
The paper and example you talk about seem to be about agent or plan mode (in LLM IDEs like Cursor, as those modes are called) while I and the parent are talking about ask mode, which is where the confusion seems to lie. Asking the LLM about the overall structure of an existing codebase works very well.
OK yes, you are right that we might be talking about employing AI toolings in different modes, and that the paper I am referring to is absolutely about agentic tooling executing code changes on your behalf.
That said, the first comment of the person I replied to contained: "You can ask agents to identify and remove cruft", which is pretty explicitly speaking to agent mode. He was also responding to a comment that was talking about how humans spend "hours talking about architectural decisions", which as an action mapped to AI would be more plan mode than ask mode.
Overall I definitely agree that using LLM tools to just tell you things about the structure of a codebase are a great way to use them, and that they are generally better at those one-off tasks than things that involve substantial multi-step communications in the ways humans often do.
I appreciate being the weeds here haha--hopefully we all got a little better talking abou the nuances of these things :)
Idealized industry practices that people wish to follow, but when it comes to meeting deadlines, I too have seen people eschew these practices for getting things out the door. It's a human problem, not one specific to any company.
Yes I recognize that, for various reasons, people will fail to document even when it is a profesional expectation.
I guess in this case we are comparing an idealized human to an idealized AI, given AI has equally its own failings in non-idealized scenarios (like hallucination).
Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
> When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?
> Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.
I'm sure each of the frontier labs have some secret methods, especially in training the models and the engineering of optimizing inference. That said, I don't think them saying they'd keep a big breakthrough secret would be evidence in this case of a "secret sauce" on ARC-AGI-2.
If they had found something fundamentally new, I doubt they would've snuck it into Gemini 3. Probably would cook on it longer and release something truly mindblowing. Or, you know, just take over the world with their new omniscient ASI :)
The amount of capital that rests on releases like these is insane. The incentive is just too high to not manipulate places like HN, which have a surprising amount of sway with tech industry sentiment.
Edit: Check out how my claim against astroturfing did, one of the first comments posted on the primary release blog: https://news.ycombinator.com/item?id=45967999#45968295. Could always be over-eager Google employees, or maybe the tech community really is this excited for the Gemini 3 release. Seems fishy to me, though...
People are likely downvoting you because allegations of astroturfing aren't allowed, you are supposed to just report what you suspect.
>Please don't post insinuations about astroturfing, shilling, brigading, foreign agents, and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email [email protected] and we'll look at the data.
That's a good point, although given I'd never seen this rule I question if it is commonly known enough that it is actually the reason I'm being downvoted.
Do you not think what has happened today is suspicious? The Gemini 3 posts are, to my eye, out of hand..
Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.
I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.
The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
AFAIK this is actually a separate mechanism, which is part of the visual cortex and not the retina. Essentially recognizing even a single object requires the complete attention of pretty much your entire brain in the moment of recognition.
What I am referring to is a much more basic form of shape recognition that goes on at the level of the neural networks in the retina.
I think in this case, tokenization and percpetion are somewhat analogous. I think it is probably the case our current tokenization schemes are really simplistic compared to what nature is working with. If you allow the analogy.
Why should it have to be expensive computationally? How do brains do it with such a low amount of energy? I think catching the brain abilities even of a bug might be very hard, but that does not mean that there isn't a way to do it with little computational power. It requires having the correct structures/models/algorithms or whatever is the precise jargon.
> How do brains do it with such a low amount of energy?
Physical analog chemical circuits whose physical structure directly is the network, and use chemistry/physics directly for the computations. For example, a sum is usually represented as the number of physical ions present within a space, not some ALU that takes in two binary numbers, each with some large number of bits, requiring shifting electrons to and from buckets, with a bunch of clocked logic operations.
There are a few companies working on more "direct" implementations of inference, like Etched AI [1] and IBM [2], for massive power savings.
This is the million dollar question. I'm not qualified to answer it, and I don't really think anyone out there has the answer yet.
My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.
My armchair is equally comfy, and I have an actual paper to point to:
Jaxley: Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics [1]
They basically created sofware to simulate real neurons and ran some realistic models to replicate typical AI learning tasks:
"The model had nine different channels in the apical and basal dendrite, the soma, and the axon [39], with a total of 19 free parameters, including maximal channel conductances and dynamics of the calcium pumps."
So yeah, real neurons are a bit more complex then ReLU or Sigmoid.
My whole point is that it maybe possible to do perception using a lot of computational power, or alternatively, there could be another kind of smart ideas that allows to do it in a diferent way with much less computation. It is not clear it requires it.
There could definitely be a chance. I was just responding to what in your comment sounded like a question.
That said, I think there is a good reason to be skeptical that it is a good chance. The consistent trend of finding higher complexity than expected in biological intelligences (like in C. Elegans), combined with the fact that the physical nature of digital architectures versus biological architectures are very different, is a good reason to bet on it being really complex to emulate with our current computing systems.
Obviously there is a way to do it physically--biological systems are physical after all--but we just don't understand enough to have the grounds to say it is "likely" doable digitally. Stuff like the Universal Approximation Theorem implies that in theory it may be possible, but that doesn't say anything about whether it is feasible. Same thing with Turing completeness too. All that these theorems say is our digital hardware can emulate anything that is a step-by-step process (computation), but not how challenging it is to emulate it or even that it is realistic to do so. It could turn out that something like human mind emulation is possible but it would take longer than the age of the universe to do it. Far simpler problems turn out to have similar issues (like calculating the optimal Go move without heuristics).
This is all to say that there could be plenty of smart ideas out there that break our current understandings in all sorts of ways. Which way the cards will land isn't really predictable, so all we can do is point to things that suggest skepticism, in one direction or another.
Following the trend of discovering smaller and smaller phenomena that our brains use for processing, it would not be surprising if we eventually find that our brains are very nearly "room temperature" quantum computers.
I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.
That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.
If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.
Sure, but the extent to which you bend the truth to get those impressive numbers is absolutely gotcha-able.
Showing a new screen by default to everyone who is using your main product flow and then claiming that everyone who is seeing it is a priori a "user" is absurd. And that is the only way they can get to 2 billion a month, by my estimation.
They could put a new yellow rectangle at the top of all google search results and claim that the product launch has reached 2 billion monthly users and is one of the fastest-growing products of all time. Clearly absurd, and the same math as what they are saying here. I'm claiming my hottake gotcha :)
"Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."
Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.
Cue: "the tools are so much better now", "the people in the study didn't know how to use Cursor", etc. Regardless if one takes issue with this study, there are enough others of its kind to suggest skepticism regarding how much these tools really create speed benefits when employed at scale. The maintenance cliff is always nigh...
There are definitely ways in which LLMs, and agentic coding tools scaffolded in top, help with aspects of development. But to say anyone who claims otherwise is either being disingenuous or doesn't know what they are doing, is not an informed take.