Used an AI to populate some of 5.1 thinking's results. Benchmark | Gemini 3 Pro ...

scrollop · 2025-11-18T14:55:41 1763477741

Used an AI to populate some of 5.1 thinking's results.

Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes

Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%

ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning

GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)

AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly

MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus

MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)

ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%

CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A

iamdelirium · 2025-11-18T17:18:08 1763486288

This is provably false. All it takes is a simple Google search and looking at the ARC AGI 2 leaderboard: https://arcprize.org/leaderboard

The 17.6% is for 5.1 Thinking High.

HardCodedBias · 2025-11-18T16:02:00 1763481720

What? The 4.5 and 5.1 columns aren't thinking in Google's report?

That's a scandal, IMO.

Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.

iosjunkie · 2025-11-18T18:43:52 1763491432

It that true?

> For Claude Sonnet 4.5, and GPT-5.1 we default to reporting high reasoning results, but when reported results are not available we use best available reasoning results.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

mountainriver · 2025-11-18T17:50:58 1763488258

Every single time