BrunoDCDO's comments

BrunoDCDO · 2025-09-17T13:59:01 1758117541

I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems

blndrt · 2025-09-18T07:29:54 1758180594

I think there's a chance we could squeeze a better benchmark score, although there's a risk of overfitting which I wanted to avoid.

The simplest test would be to make previously “unreachable” tasks succeed through obvious prompt tweaks — like reordering instructions or emphasizing key parts.

That said, my methodology intentionally avoided exposing the model to actual tasks. Instead, I focused on the domain as a whole: refining the instructions so a smaller model could understand and act reliably.

BrunoDCDO · 2025-04-29T02:19:45 1745893185

I think it's actually due to the fact that Claude isn't available on China, so they wouldn't be able to (legally) replicate how they evaluated the other LLMs (assuming that they didn't just use the numbers reported by each model provider)