Just curious, can you share what are those hardest puzzles that even the top models can't crack? sometimes when I find the puzzle absolutely undecipherable I like to ask LLMs to solve it, and I haven't seen them fail yet.
I actually tried using GPT-5.5 Pro on this problem recently. It thought it was making progress on one path, but it made so many mistakes that it didn't feel worth it pushing further. It'll be interesting to check whether it's the same route. I got partial results (proved in Lean) that improve on the best-known results for four Erdős problems with GPT-5.5 Pro
I built this benchmark this month: https://github.com/lechmazur/sycophancy. There are large differences between LLMs. There are large differences between LLMs. For example, Mistral Large 3 and GPT-4.1 will initially agree with the narrator, while Gemini will disagree. I swap sides, so this is not about possible viewpoint bias in the LLMs. But another benchmark shows that Gemini will then change its view very easily in a multi-turn conversation while Kimi K2.5 or Grok won't: https://github.com/lechmazur/persuasion.
I built two related benchmarks this month: https://github.com/lechmazur/sycophancy and https://github.com/lechmazur/persuasion. There are large differences between LLMs. For example, good luck getting Grok to change its view, while Gemini 3.1 Pro will usually disagree with the narrator at first but then change its position very easily when pushed.
reply