That is pretty much a universal problem. If you look at the problems anyone's models has solved, they are all well represented in the corpus.
Remember that AIME is intended for high schoolers with just pencils, erasers, rulers, and compasses to solve in 3 hours. There is an entire industry providing supplementary material to prepare students for concepts are not directly covered in typical school material.
As various blogs and tests often pull from previous years make it into all the common sources like stackoverlow/exchange, reddit etc.., them explicitly stating to have trained on AIME problems prior to 2024 explicitly isn't much different.
Basically expect any model to train on all AIME problems available before their knowledge cutoff date.
To me, "How is the score on AIME2024 relevant" is because it is still not that high (from a practical consideration) despite directly training on it.
Mixed in with all the models success falling dramatically with AIME2025 demonstrates the above, and hints that Rao's claim that compiling in the verifier in training/scratch-space/prompt/fine-tuning etc... in a way the model can reliably access is what matters.