Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Organizations now generate 10x the amount of code, because everyone can do it.

But we have exactly the same number of reviewers. How the heck are we gonna deal with it when we cannot use LLMs for sanity checking LLM code?

Like literally yesterday I had a not-technical person who used codex to build an optimization algorithm, and due to the momentum it gained I was asked to “fix the rough edges and help with scaling”.

The entire thing was trash (was trying to do naive search in a combinatorial problem with 1000s of integers, and was violating constraints with high probability, including the integrality). I had to spend all my day reviewing it and make a technical presentation to their leadership that it is just a polished turd.



> How the heck are we gonna deal with it when we cannot use LLMs for sanity checking LLM code?

Unit testing. LLM's are very good at writing tests and writing code that is testable (as long as you ask it), and if you just check that the tests are actually calling the code and doing so with all the obvious edge cases and that the results are correct, that's actually quite fast to review -- faster than reviewing the code.

And you can include things like performance testing in tests as well.

We're moving to a world where we work with definitions and tests and are less concerned with the precise details of how code is written within functions. Which is a big shift in mindset.


The only way this might work (IMO) is writing the tests yourself (but of course, this requires you to plan and design very meticulously in advance) and doing some kind of “blind TDD” where the LLM is not able to see the tests, only run them and act on the results. Even then, I’ve had Claude (Opus 4.1) bypass tests by hardcoding conditions as it found them so I’d say reliability for this method is not 100%.

Having the LLM write the tests is… well, a recipe for destruction unless you babysit it and give it extremely specific restrictions (again, I’ve done this in mid to large sized projects with fairly comprehensive documentation on testing conventions and results have been mixed: sometimes the LLM does an okay job but tests obvious things, sometimes it ignores the instructions, sometimes it hardcodes or disables conditions…)


I’ve been saying this for years now: you can’t avoid communicating what you want a computer to do. The specific requirements have to be made somewhere.

Inferring intent from plain english prompts and context is a powerful way for computers to guess what you want from underspecified requirements, but the problem of defining what you want specifically always requires you to convey some irreducible amount of information. Whether it’s code, highly specific plain english, or detailed tests, if you care about correctness they all basically converge to the same thing and the same amount of work.


> if you care about correctness they all basically converge to the same thing and the same amount of work.

That's the part I'd push back on. They're not the same amount of work.

When I'm writing the code myself, it's basically a ton of "plumbing" of loops and ifs and keeping track of counters and making sure I'm not making off-by-one errors and not making punctuation mistakes and all the rest. It actually takes quite a lot of brain energy and time to get that all perfect.

It saves a lot of time to write the function definition in plain English, have the LLM generate a bunch of tests that you verify are the correct definition... and then let the LLM take care of all the loops and indexing and punctuation and plumbing.

I regularly cut what used to be an entire afternoon or day's worth of work down into 30 minutes. I spend 10 minutes writing the design for what will be 500-1,000 lines of code, 5 minutes answering the LLM's questions about it, 5 minutes skimming the code to make sure it all looks vaguely plausible (no obvious red flags), 5 minutes ensuring the unit tests cover everything I can think of (almost always, the LLM has thought of a bunch of edge cases I never would have bothered to test), and another 5 minutes telling it to fix things, like its unit tests make me suddenly realize there's an edge case that should be defined differently.

The idea that it's the "same amount of work" is crazy to me. It's so much more efficient. And in all honesty, the code is more reliable too because it tests things that I usually wouldn't bother with, because writing all the tests is so boring.


> When I'm writing the code myself, it's basically a ton of "plumbing" of loops and ifs and keeping track of counters and making sure I'm not making off-by-one errors and not making punctuation mistakes and all the rest. It actually takes quite a lot of brain energy and time to get that all perfect.

All of that "plumbing" affects behavior. My argument is that all of the brain energy used when checking that behavior is necessary in order to check that behavior. Do you have a test for an off by one error? Do you have a test to make sure your counter behaves correctly when there are multiple components on the same page? Do you have a test to make sure errors don't cause the component to crash? Do you have a test to ensure non utf-8 text or binary data in a text input throws a validation error? Etc etc. If you're checking all the details for correct behavior, the effort involved converges to roughly the same thing.

If you're not checking all of that plumbing, you don't know whether or not the behavior is correct. And the level of abstraction used when working with agents and LLMs is not the same as when working with a higher level language, because LLMs make no guarantees about the correspondence between input and output. Compilers and programming languages are meticulously designed to ensure that output is exactly what is specified. There are bugs and edge cases in compilers and quirks based on different hardware, so it's not always 100% perfect, but it's 99.9999% perfect.

When you use an LLM, you have no guarantees about what it's doing, and in a way that's categorically different than not knowing what a compiler does. Very few people know all of the steps that break down `console.log("hello world")` into the electrical signals that get sent to the pixels on a screen on a modern OS using modern hardware given the complexity of the stack, but they do know with as close as is humanly possible to 100% certainty that a correctly configured environment will result in that statement outputting the text "hello world" to a console. They do not need to know the implementation because the contract is deterministic and well defined. Prompts are not deterministic nor well defined, so if you want to verify it's doing what you want it to do, you have to check what it's doing in detail.

Your basic argument here is that you can save a lot of time by trusting the LLM will faithfully wire the code as you want, and that you can write tests to sanity check behavior and verify that. That's a valid argument, if you're ok tolerating a certain level of uncertainty about behavior that you haven't meticulously checked or tested. The more you want to meticulously check behavior, the more effort it takes, and the more it converges to the effort involved in just writing the code normally.


> If you're checking all the details for correct behavior, the effort involved converges to roughly the same thing.

Except it doesn't. It's much less to verify the tests.

> That's a valid argument, if you're ok tolerating a certain level of uncertainty about behavior that you haven't meticulously checked or tested.

I'm a realist, and know that I, like all other programmers, am fallible. Nobody writes perfect code. So yes, I'm ok tolerating a certain level of uncertainty about everybody's code, because there's no other choice.

I can get the same level of uncertainty in far less time with an LLM. That's what makes it great.


> Except it doesn't. It's much less to verify the tests.

This is only true when there is less information in those tests. You can argue that the extra information you see in the implementation doesn't matter as long as it does what the tests say, but the amount of uncertainty depends on the amount of information omitted in the tests. There's a threshold over which the effort of avoiding uncertainty becomes the same as the effort involved in just writing the code. Whether or not you think that's important depends on the problem you're working on and your tolerance for error and uncertainty, and there's no hard and fast rule for that. But if you want to approach 100% correctness, you need to attempt to specify your intentions 100% precisely. The fact that humans make mistakes and miscommunicate their intentions does not change the basic fact that a human needs to communicate their intention for a machine to fulfill that intention. The more precise the communication, the more work that's involved, regardless of whether you're verifying that precision after something generates it or generating it yourself.

> I can get the same level of uncertainty in far less time with an LLM. That's what makes it great.

I have a low tolerance for uncertainty in software, so I usually can't reach a level I find acceptable with an LLM. Fallible people who understand the intentions and current function of a codebase have a capacity that a statistical amalgamation of tokens trained on fallible people's output simply do not have. People may not use their capacity to verify alignment between intention and execution well, but they have it.

Again, I'm not denying that there's plenty of problems where the level of uncertainty involved in AI generated code is acceptable. I just think it's fundamentally true that extra precision requires extra work/there's simply no way to avoid that.


> I have a low tolerance for uncertainty in software

I think that's what's leading you to the unusual position that "This is only true when there is less information in those tests."

I don't believe in perfection. It's rarely achieved despite one's best efforts -- it's a mirage. What we can realistically look for is a statistical level of reliability that tests help achieve.

At the end of the day, it's about delivering value. If you can on average deliver 5x value with an LLM because of the speed, or 1.05x value because you verified every line of code 3 times and avoided a rare bug that both the LLM and you didn't think about testing (compared to the 1x value of a non-perfectionist developer), then I know which one I'm choosing.


Most unit tests aren't that complicated.

You take the smallest value and biggest value, do they work?

Take something in the middle, does that work?

Get the smallest and make it even smaller, does it break?

Get the biggest value, make it bigger, does it break?

GOTO 10

And when you got the pattern down, checking the rest is mostly just copying and pasting with different values on "smallest" and "biggest".

Something an LLM is very very good at.

Also you should always use another LLM to critique your primary one (or the same LLM with a clear context). I've found that gpt-5-high is VERY good at finding subtle bugs Claude will never catch. It can fix them immediately when I give it the Codex commentary though.


> Unit testing. LLM's are very good at writing tests and writing code that is testable (as long as you ask it)

The unit tests LLMs generate are also often crap, testing tautologies, making sure that your dependencies act as specified without testing the actual code, etc.


Not in my experience.

Maybe if your instructions are super unclear?


I’ve been criticized for this by my coworkers in the past, but I strongly believe that this is generally true and has been for quite a while. Developers, myself included, like to think their code is special, set in stone and going to last forever. Most the code we write struggles to live a few years yet we treat all of it like it’s going to last forever. I’ve been an advocate for flipping that and treating it like our code will not last long, and when we identify the components that will, going back and optimizing them.

I’m pretty confident that most developers, again including myself, just really enjoy knowing something is done well. Being able to separate yourself from the code and fixate solely on the outcomes can sometimes get me past this.


I think this is true for the edges, but if you build on top of software that's not done well, it's a bad time.


Sadly i find most software I am building on top of is pretty awful...but i'm working in the real estate world right now, so that is unavoidable.


Your first response sounded like:

You got more diabetes? Use more insulin :x (insulins are very good handling diabetes) (analogy).

Seniors would tell: the more you get in seniority the more you delete code. So I don't think, more cushion for higher jumping is the solution, sometimes you don't need to jump from that high.

We're moving to Junior Generative Juniors, recursively.


They're OK at it. I usually get more thoroughness of scenarios than a mediocre human engineer (which is great!) but less thoroughness of validation and output checking than a good human engineer (which is less so).

But if you have a lot of unit tests and need to make a cross-cutting refactor you run into the same problem that you always have if all your coverage is at the unit level. Now your unit boundary is fundamentally different and you need to know how to lift and shift all the relevant tests to the relevant new places.

And so far I've been less impressed by the "agents"' attempts at cross-cutting integration testing since this usually requires selective and clever interface setup and refactoring.

LLMs have a habit of creating one-off things for particular unit test scenarios that doesn't scale well to that problem.


Everyone keeps talking about unit testing as the answer to this problem. But we need to remember, as Dijkstra famously explained, that tests cannot prove the absence of bugs. Tests can only prove that bugs exist.


Doesn't work if black box. You still have to inspect code performing the operation


The same way that we dealt with Excel programming. Ignore it until it blows up, then spend hundreds of thousands trying to fix it before the company goes bankrupt.


"But we have exactly the same number of reviewers."

LLMs can help with reviews as well. LLMs are not too bad at reviewing code; GPT 5 for example can find off-by-one, missed returns, all sorts of problems that are localized. I think they have a harder time with issues requiring a higher-level global understanding. I wonder if in the future you could fine-tune an LLM on a big codebase (maybe nightly or something) and it could be the first-level reviewer for all changes to that codebase.


OR problems are hard because whoever try to vibe coding it probably don't realize they fall into a specific algorithm and can prompt llm to do thatl; what's worse is that even if you tell them so they won't be able to understand the math behind it and would much prefer their vide coding solution.


<< and make a technical presentation to their leadership

Honestly, this may be the only way to go about it.


The problem can probably almost certainly be solved to provable optimality using HiGHS or even CBC - the open source Python package PuLP comes with CBC.

If you want to be seen as the hero who solves things instead of the realist who says why other solutions won’t work, this could be worth exploring.


Of course it can be solved using the proper tools by a domain expert/practitioner (even with chatGPT since the expert will know what to ask).

But why didn't the AI expert solve it using chatGPT? If it has to land to an expert for reimplementation from scratch after wasting a day on reviewing slop, did we gain productivity?


Gain on productivity, maybe nothing. I didn’t mean to comment on the broader scope, just an actionable suggestion for GP.


It's basically a scheme letting less scrupulous people offload work on others while at the same time making good impressions to manager, PMs and CEOs who want to automate our job and fire us.


PMs, POs and CEOs should have been the first jobs to automate in our industry. Nothing of what the vast majority of them do is based on evidence and verifiable facts, it's already all vibes anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: