Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.


> SWE-bench, and many other AI benchmarks, have lots of eval noise

SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.

> where there is no clear right answer

This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.


Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: