Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao [1].

Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.

[1] https://www.swebench.com/



claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: