AlphaFold has been widely validated- it's now appreciated that its predictions are pretty damn good, with a few important exceptions, instances of which are addressed with the newer implementation.
So... what percentage of the time? If you made an AI to pilot an airplane, how would you verify its edge conditions, you know, like plummeting out of the sky because it thought it had to nosedive?
Because these AIs are black box neural networks, how do you know they are predicting things correctly for things that aren't in the training dataset?
As mentioned elsewhere and this thread and trivially determinable by reading, AF2 is constantly being evaluated in blind predictions where the known structure is hidden until after the prediction. There's no weasel here; the process is well-understood and accepted by the larger community.