The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.
> SWE-bench, and many other AI benchmarks, have lots of eval noise
SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.
> where there is no clear right answer
This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.