The reason they get a perfect score on AIME is because every question on AIME ha...

mbesto · 2025-09-29T18:28:31 1759170511

> SWE-bench, and many other AI benchmarks, have lots of eval noise

SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.

> where there is no clear right answer

This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.

mrshu · 2025-09-29T18:03:21 1759169001

Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?