Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.
I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"
I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"