Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How is the score on AIME2024 relevant if AIME2024 has been used to train the model?


That is pretty much a universal problem. If you look at the problems anyone's models has solved, they are all well represented in the corpus.

Remember that AIME is intended for high schoolers with just pencils, erasers, rulers, and compasses to solve in 3 hours. There is an entire industry providing supplementary material to prepare students for concepts are not directly covered in typical school material.

As various blogs and tests often pull from previous years make it into all the common sources like stackoverlow/exchange, reddit etc.., them explicitly stating to have trained on AIME problems prior to 2024 explicitly isn't much different.

Basically expect any model to train on all AIME problems available before their knowledge cutoff date.

To me, "How is the score on AIME2024 relevant" is because it is still not that high (from a practical consideration) despite directly training on it.

Mixed in with all the models success falling dramatically with AIME2025 demonstrates the above, and hints that Rao's claim that compiling in the verifier in training/scratch-space/prompt/fine-tuning etc... in a way the model can reliably access is what matters.


Google Gemini (2.5 pro) made the same "mistake", their data cut off is January 2025, and AIME 2024 is in Feburary 2024..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: