I see a lot of threads pitting models against each other (or whole swarms of the...

zurfer · 2025-04-30T07:09:43 1745996983

Your references show me that it is absolutely task depended. In many domains it's true that "criticizing is easier than creating".

The best example might be books and movies, where it's trivial to say the characters were shallow, but it's surprisingly hard to create deeply interesting characters.

In Software Engineering, there are similar dynamics. An LLM with a security vuln finding prompt will be able to point out places, where the generated code might be insecure.

But if you want another LLM to find a reasoning mistake in a mathematical proof it basically has to do all the reasoning work as well. In which case I doubt there will be any significant performance gains.

aoeusnth1 · 2025-04-30T15:57:43 1746028663

In principle, Math proofs are another relatively easy to verify problem. In the extreme case, you can express any math proof as a computer-verifiable formalism — no intelligence necessary. Step back one step, and you could have a relatively weak model translate a proof into verifiable formalism and then use a tool call to run the verification. Coming up with the proof is an expensive search process, while verifying it is more mechanical. Even if it is not completely trivial to make the proof computer-verifiable, it might still be a vastly easier task compared to finding the proof in the first place.

simulator5g · 2025-04-30T20:21:35 1746044495

An LLM cannot reason through a mathematical proof, it would be something other than an LLM if it could.

mycall · 2025-04-30T21:44:19 1746049459

LLM is a overloaded term now as ML models can do tool calls, or MoE segmentation can have specialized solvers embedded... but people will call all variations LLMs.

meander_water · 2025-04-30T04:15:46 1745986546

For better or worse this has become the defacto standard in LLM Evaluation research papers since the LLM as a Judge paper [0] came out. Its also heavily embedded into frameworks like LangChain and LlamaIndex to evaluate RAG pipelines.

[0] https://arxiv.org/abs/2306.05685

[1] https://arxiv.org/abs/2411.15594

swyx · 2025-04-30T05:52:49 1745992369

its for the better, and i'm actually serious about this. it's just that Subbarao is ALSO right and it is not perfect nor human level. but it -DOES- improve results measurably and consistently.

so what i'm saying is don't throw the baby out with the bathwater. LLM as judge doesnt replace human judgement but its a pretty darn good first pass for how cheap it is. and you can imagine that it will get better over time.

hu3 · 2025-04-30T03:04:21 1745982261

> ...so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Agree. What do you think about telling the LLM to also generate unit tests for the code it spits and then run all tests (including previous application unit tests).

I think this is a way to ensure some level of grounded verification:

- Does code compile?

- Do unit test pass?

AI can then consume test results to help fix their own mistakes.

nojs · 2025-04-30T04:04:48 1745985888

This works well but only if you eyeball the tests and edit them a bit in my experience. Otherwise it gets lazy and makes them trivial to pass. Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.

eru · 2025-04-30T18:10:28 1746036628

> Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.

You can use property based testing for that.

But I've often run into cases where the AI gets into a vicious spiral of worse and worse code when you keep feeding it the test failures.

macrolime · 2025-04-30T10:46:08 1746009968

Yet, having the LLM evaluate the tests often works well. Especially if you ask it to look for hardcoded test cases.

dwaltrip · 2025-04-30T03:34:04 1745984044

Definitely, test runners are a way to ground the model and give it a feedback loop. Not a silver bullet but can be very helpful.

keepamovin · 2025-04-30T07:07:51 1745996871

I believe, what the smart AI company is trying to do, right now, in secret, is to use US, the humans, and our replies to the AIs, as training for the next generation of self-verifying-models. :)

Training on corpus data gets you to 1 order of magnitude. But training on interactive data where you can observe and adapt to the OODA-loop? So much more powerful.

At least, that's what I'd be doing if I were doing AI :)

But I just do BrowserBox

captainbland · 2025-04-30T11:11:42 1746011502

I think you'd need to screen for quality of response quite stringently as loads of people will produce "corrections" which are just plain wrong.

keepamovin · 2025-05-01T03:24:41 1746069881

Good point! But you could probably identify "super users" who are the ones whose responses you want to mine hahaha :)

mcswell · 2025-05-01T03:37:17 1746070637

I assume everyone knows this, but the idea of generating answers and testing them, dates back decades, and has been widely used for problems where generating _the_ correct answer(s) is difficult, but where generating a bunch of potential answers--(at least) one of which is likely correct--is easier. Generate-and-test of course relies on having a test algorithm that is reliable, (relatively) fast, and memory efficient, and is most useful when an exact generate algorithm (one that generated only the correct answer(s)) is either slow or inefficient of memory use (or both).

In the case described, the generator is an LLM, and the tester (called a "verifier") is "the compiler, linter, SAT solver, ground truth dataset, etc."

And of course generate-and-test is related to trial-and-error, which has probably existed since the Paleolithic.

foobiekr · 2025-04-30T17:08:52 1746032932

"letting GPT-4 critique its own answers reduces accuracy"

This is because the output, being the input, steers directly into the tree as soon as the tree is in the context window.

ashu1461 · 2025-04-30T03:07:24 1745982444

Would a LLM under human guidance turn out to be a good verifier ? i.e. if LLM knows the rules to verify or has enough data points (internet access, actual responses)

eru · 2025-04-30T18:08:47 1746036527

Of course, that only works for problems where you have a verifier.

autokad · 2025-04-30T15:55:14 1746028514

actually, I found that you can definitely yield better results. I ran an experiment with 1 prompt at temperature 0 and 9 with temperature 1.

I found the most anomalous response was as good (15/20) or better (5/20) than the temperature 0 response in 20 samples.