Hacker Newsnew | past | comments | ask | show | jobs | submit | zone411's commentslogin

That's not proof. Emergent intelligence is not consciousness.


Good stuff!

Is there a reason you change the leaderboard graphs for the third and fourth one?

Also: would be great to have an overview page with a summary over all test, like a total score or similar.


oh, I love the connections benchmark.

Just curious, can you share what are those hardest puzzles that even the top models can't crack? sometimes when I find the puzzle absolutely undecipherable I like to ask LLMs to solve it, and I haven't seen them fail yet.


Ask your top model this question : I'm 100 feet away from the carwash, should I drive my car or walk ?

You messed up the question.

Would be interesting to see the 27B dense Qwen 3.6 model thrown into the mix.

100%. It's sad to see that this attitude has spread to HN


I actually tried using GPT-5.5 Pro on this problem recently. It thought it was making progress on one path, but it made so many mistakes that it didn't feel worth it pushing further. It'll be interesting to check whether it's the same route. I got partial results (proved in Lean) that improve on the best-known results for four Erdős problems with GPT-5.5 Pro



How are those "conservative opinions"? Are you saying the whole thing was right-wing fan-fiction?


I built this benchmark this month: https://github.com/lechmazur/sycophancy. There are large differences between LLMs. There are large differences between LLMs. For example, Mistral Large 3 and GPT-4.1 will initially agree with the narrator, while Gemini will disagree. I swap sides, so this is not about possible viewpoint bias in the LLMs. But another benchmark shows that Gemini will then change its view very easily in a multi-turn conversation while Kimi K2.5 or Grok won't: https://github.com/lechmazur/persuasion.


I built two related benchmarks this month: https://github.com/lechmazur/sycophancy and https://github.com/lechmazur/persuasion. There are large differences between LLMs. For example, good luck getting Grok to change its view, while Gemini 3.1 Pro will usually disagree with the narrator at first but then change its position very easily when pushed.


Hmm, maybe in the next edition, Opus gets expensive. I should probably run GPT-5.4 xhigh too if I do that for fairness...


Rationalists were right about everything that mattered: crypto, AI, COVID... HN commentators, by contrast, were wrong about everything that mattered.


Results from my Extended NYT Connections benchmark:

GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).

GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).

GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).


How do you score this? Losing/winning the game with 4 lives?



Impressive! Do you include puzzles released before the training data cutoff date?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: