I'm curious why you think the answer would be no. I've had some success with res...

ben_w · 2026-04-15T08:55:19 1776243319

I've been using 5.4 recently, and even on "extra high" some of the tests it wrote were opening the source code and doing a regex to confirm the presence (or in some cases the absence) of specific substrings. It wasn't running the code to confirm behaviour, and the regexes didn't even do a basic check to confirm the text wasn't commented out (not that it would've been sufficient if they had, this is just to illustrate how bad it was).

So, yeah. I'd guesstimate this model was fine 75% of the time, mediocre 15-20%, and actively bad 5-10% of the time. How valuable it is depends on how much energy you can spare as a human on spotting the bad.