Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.

I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"



That's why they made the swe verified. Verified excludes those




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: