Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I see a lot of threads pitting models against each other (or whole swarms of them) in the hope that "wisdom of crowds" will magically appear. After a stack of experiments of my own—and after watching the recent ASU/Microsoft-Research work [1].. I've landed on a simpler takeaway:

An LLM is a terrible verifier of another LLM. Subbarao Kambhampati's "(How) Do LLMs Reason/Plan?" talk shows GPT-4 confidently producing provably wrong graph-coloring proofs until a symbolic SAT solver is introduced as the referee [1]. Stechly et al. quantify the problem: letting GPT-4 critique its own answers *reduces* accuracy, whereas adding an external, sound verifier boosts it by ~30 pp across planning and puzzle tasks [2]. In other words, verification is *harder* than generation for today's autoregressive models, so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Because of that asymmetry, stacking multiple LLMs rarely helps. The "LLM-Modulo" position paper argues that auto-regressive models simply can't do self-verification or long-horizon planning on their own and should instead be treated as high-recall idea generators wrapped by a single, sound verifier [3]. In my tests, replacing a five-model "debate" with one strong model + verifier gives equal or better answers with far less latency and orchestration overhead.

[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)

[2] https://arxiv.org/abs/2402.08115

[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)



Your references show me that it is absolutely task depended. In many domains it's true that "criticizing is easier than creating".

The best example might be books and movies, where it's trivial to say the characters were shallow, but it's surprisingly hard to create deeply interesting characters.

In Software Engineering, there are similar dynamics. An LLM with a security vuln finding prompt will be able to point out places, where the generated code might be insecure.

But if you want another LLM to find a reasoning mistake in a mathematical proof it basically has to do all the reasoning work as well. In which case I doubt there will be any significant performance gains.


In principle, Math proofs are another relatively easy to verify problem. In the extreme case, you can express any math proof as a computer-verifiable formalism — no intelligence necessary. Step back one step, and you could have a relatively weak model translate a proof into verifiable formalism and then use a tool call to run the verification. Coming up with the proof is an expensive search process, while verifying it is more mechanical. Even if it is not completely trivial to make the proof computer-verifiable, it might still be a vastly easier task compared to finding the proof in the first place.


An LLM cannot reason through a mathematical proof, it would be something other than an LLM if it could.


LLM is a overloaded term now as ML models can do tool calls, or MoE segmentation can have specialized solvers embedded... but people will call all variations LLMs.


For better or worse this has become the defacto standard in LLM Evaluation research papers since the LLM as a Judge paper [0] came out. Its also heavily embedded into frameworks like LangChain and LlamaIndex to evaluate RAG pipelines.

[0] https://arxiv.org/abs/2306.05685

[1] https://arxiv.org/abs/2411.15594


its for the better, and i'm actually serious about this. it's just that Subbarao is ALSO right and it is not perfect nor human level. but it -DOES- improve results measurably and consistently.

so what i'm saying is don't throw the baby out with the bathwater. LLM as judge doesnt replace human judgement but its a pretty darn good first pass for how cheap it is. and you can imagine that it will get better over time.


> ...so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Agree. What do you think about telling the LLM to also generate unit tests for the code it spits and then run all tests (including previous application unit tests).

I think this is a way to ensure some level of grounded verification:

- Does code compile?

- Do unit test pass?

AI can then consume test results to help fix their own mistakes.


This works well but only if you eyeball the tests and edit them a bit in my experience. Otherwise it gets lazy and makes them trivial to pass. Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.


> Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.

You can use property based testing for that.

But I've often run into cases where the AI gets into a vicious spiral of worse and worse code when you keep feeding it the test failures.


Yet, having the LLM evaluate the tests often works well. Especially if you ask it to look for hardcoded test cases.


Definitely, test runners are a way to ground the model and give it a feedback loop. Not a silver bullet but can be very helpful.


I believe, what the smart AI company is trying to do, right now, in secret, is to use US, the humans, and our replies to the AIs, as training for the next generation of self-verifying-models. :)

Training on corpus data gets you to 1 order of magnitude. But training on interactive data where you can observe and adapt to the OODA-loop? So much more powerful.

At least, that's what I'd be doing if I were doing AI :)

But I just do BrowserBox


I think you'd need to screen for quality of response quite stringently as loads of people will produce "corrections" which are just plain wrong.


Good point! But you could probably identify "super users" who are the ones whose responses you want to mine hahaha :)


I assume everyone knows this, but the idea of generating answers and testing them, dates back decades, and has been widely used for problems where generating _the_ correct answer(s) is difficult, but where generating a bunch of potential answers--(at least) one of which is likely correct--is easier. Generate-and-test of course relies on having a test algorithm that is reliable, (relatively) fast, and memory efficient, and is most useful when an exact generate algorithm (one that generated only the correct answer(s)) is either slow or inefficient of memory use (or both).

In the case described, the generator is an LLM, and the tester (called a "verifier") is "the compiler, linter, SAT solver, ground truth dataset, etc."

And of course generate-and-test is related to trial-and-error, which has probably existed since the Paleolithic.


"letting GPT-4 critique its own answers reduces accuracy"

This is because the output, being the input, steers directly into the tree as soon as the tree is in the context window.


Would a LLM under human guidance turn out to be a good verifier ? i.e. if LLM knows the rules to verify or has enough data points (internet access, actual responses)


Of course, that only works for problems where you have a verifier.


actually, I found that you can definitely yield better results. I ran an experiment with 1 prompt at temperature 0 and 9 with temperature 1.

I found the most anomalous response was as good (15/20) or better (5/20) than the temperature 0 response in 20 samples.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: