We don't, but the point is that it's only one part of the entire system. If you have a (human-supplied) scoring function, then even completely random mutations can serve as a mechanism to optimize: you generate a bunch, keep the better ones according to the scoring function and repeat. That would be a very basic genetic algorithm.
The LLM serves to guide the search more "intelligently" so that mutations aren't actually random but can instead draw from what the LLM "knows".
In this case AlphaEvolve doesn't write proofs, it uses the LLM to write Python code (or any language, really) that produces some numerical inputs to a problem.
They just try out the inputs on the problem they care about. If the code gives better results, they keep it around. They actually keep a few of the previous versions that worked well as inspiration for the LLM.
If the LLM is hallucinating nonsense, it will just produce broken code that gives horrible results, and that idea will be thrown away.
The final evaluation is performed with a deterministic tool that's specialized for the current domain. It doesn't care that it's getting its input from a LLM that may be allucinating.
The catch however is that this approach can only be applied to areas where you can have such an automated verification tool.
Google's system is like any other optimizer, where you have a scoring function, and you keep altering the function's inputs to make the scoring function return a big number.
The difference here is the function's inputs are code instead of numbers, which makes LLMs useful because LLMs are good at altering code. So the LLM will try different candidate solutions, then Google's system will keep working on the good ones and throw away the bad ones (colloquially, "branch is cut").
Exactly, he even mentioned that it's a variant of traditional optimization tool so it's not surprising to see cutting-plane methods and when the structure allows; benders decomposition
The LLM basically just produces some code that either runs and produces good results or it doesn't. If it produces garbage, that is the end of the line for that branch.
Can you explain more on this? How on earth are we supposed to know LLM is hallucinating?