Don't the models typically train on their input too? I.e. submitting the question also carries a risk/chance of it getting picked up?
I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?
Yeah, probably asking on LMArena makes this an invalid benchmark going forward, especially since I think Google is particular active in testing models on LMArena (as evidenced by the fact that I got their preview for this question).
I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.