More

zone411 · 2025-12-17T21:29:54 1766006994

Scores 92.0 on my Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/). Gemini 2.5 Flash scored 25.2, and Gemini 3 Pro scored 96.8.

zone411 · 2025-12-11T19:46:53 1765482413

I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

Donald · 2025-12-11T19:57:21 1765483041

Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive

capitainenemo · 2025-12-11T20:01:22 1765483282

And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).

I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.

wooger · 2025-12-12T11:50:42 1765540242

> unless I guess they routinely index this repo

This sounds like exactly the kind of thing any tech company would do when confronted with a competitive benchmark.

rsanek · 2025-12-12T13:40:10 1765546810

I mean, the repo has <200 stars, it's not like it's so mainstream that you'd expect LLM makers to be watching it actively. If they wanted to game it, they could more easily do that in RL with synthetic data anyway.

capitainenemo · 2025-12-17T02:02:36 1765936956

Belated update on this. Gemini reasoning did much better than quick on bracket city today (an easy puzzle but still). It only failed to solve one clue outright, got another wrong but due to ambiguity in the expression referenced and in a way that still fit the next level down making the final answer fairly cleanly solved. Still clearly has a harder time with it than the connections puzzle.

bigyabai · 2025-12-11T20:19:10 1765484350

GPT-5.2 might be Google's best Gemini advertisement yet.

outside1234 · 2025-12-11T20:34:25 1765485265

Especially when you see the price

tikotus · 2025-12-11T20:24:04 1765484644

Here's someone else testing models on a daily logic puzzle (Clues by Sam): https://www.nicksypteras.com/blog/cbs-benchmark.html GPT 5 Pro was the winner already before in that test.

thanhhaimai · 2025-12-11T20:30:17 1765485017

This link doesn't have Gemini 3 performance on it. Do you have an updated link with the new models?

dezgeg · 2025-12-12T06:30:49 1765521049

I've also tried Gemini 3 for Clues by Sam and it can do really well, have not seen it make a single mistake even for Hard and Tricky ones. Haven't run it on too many puzzles though.

crapple8430 · 2025-12-11T20:39:37 1765485577

GPT 5 Pro is a good 10x more expensive so it's an apples to oranges comparison.

fellowniusmonk · 2025-12-13T08:55:56 1765616156

I think they are overfitting more, I'm seeing it perform worse on esoteric logic puzzles.

Bombthecat · 2025-12-12T07:20:29 1765524029

I would like to see a cost per percent or so row. I feel like grok would beat them all

scrollop · 2025-12-11T21:47:14 1765489634

Why no grok 4.1 reasoning?

sanex · 2025-12-12T00:43:34 1765500214

Do people other than Elon fans use grok? Honest question. I've never tried it.

buu700 · 2025-12-12T06:23:27 1765520607

I use Grok pretty heavily, and Elon doesn't factor into it any more than Sam and Sundar do when I use GPT and Gemini. A few use cases where it really shines:

* Research and planning

* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)

* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding

I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.

As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)

For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.

mac-attack · 2025-12-12T04:45:36 1765514736

I can't understand why people would trust a CEO that regularly lies about product timelines, product features, his own personal life, etc. And that's before politicizing his entire kingdom by literally becoming a part of government and one of the larger donations of the current administration.

delaminator · 2025-12-12T09:17:01 1765531021

You’re not narrowing it down.

lkjdsklf · 2025-12-12T07:14:18 1765523658

If we stopped using products of every company that had a CEO that lied about their products, we’d all be sitting in caves staring at the dirt

fatata123 · 2025-12-12T08:02:03 1765526523

Because not everyone makes their decisions through the prism of politics

sz4kerto · 2025-12-12T10:50:46 1765536646

I'm using Gemini in general, but Grok too. That's because sometimes Gemini Thinking is too slow, but Fast can get confused a lot. Grok strikes a nice balance between being quite smart (not Gemini 3 Pro level, but close) and very fast.

ralusek · 2025-12-12T07:41:54 1765525314

Only thing I use grok for is if there is a current event/meme that I keep seeing referenced and I don't understand, it's good at pulling from tweets

wdroz · 2025-12-12T09:57:17 1765533437

Unlike openai, you can use the latest grok models without verifying your organization and giving your ID.

jbm · 2025-12-12T07:32:56 1765524776

I use a few AIs together to examine the same code base. I find Grok better than some of the Chinese ones I've used, but it isn't in the same league as Claude or Codex.

rsanek · 2025-12-12T13:43:06 1765546986

it's the biggest model on OpenRouter, even if you exclude free tier usage https://openrouter.ai/state-of-ai

irthomasthomas · 2025-12-12T14:03:21 1765548201

Roleplay is the largest use-case on openrouter.

bumling · 2025-12-12T05:34:07 1765517647

I dislike Musk, and use Grok. I find it most useful for analyzing text to help check if there's anything I've missed in my own reading. Having it built in to Twitter is convenient and it has a generous free tier.

scrollop · 2025-12-12T12:57:49 1765544269

I hate the guy, however grok scores high on arc-2 so it would be silly to not at least rank it.

zone411 · 2025-12-03T07:47:03 1764748023

Without monitoring, you can definitely end up with rule-breaking behavior.

I ran this experiment: https://github.com/lechmazur/emergent_collusion/. An agent running like this would break the law.

"In a simulated bidding environment, with no prompt or instruction to collude, models from every major developer repeatedly used an optional chat channel to form cartels, set price floors, and steer market outcomes for profit."

rossant · 2025-12-03T08:02:48 1764748968

Very interesting. Is there any other simulation that also exhibits spontaneous illegal activity?

zone411 · 2025-12-03T23:56:06 1764806166

I did some searches when I posted this project, but I didn't find any at the time.

Dilettante_ · 2025-12-03T11:52:54 1764762774

Cooperation makes sense for how these fellas are trained. Did you ever see defection, where an agent lied about going along with a round of collusion?

zone411 · 2025-12-03T23:58:12 1764806292

I haven't looked in the logs for this in this particular project, but I've seen this occur frequently in my multiplayer benchmarks.

zone411 · 2025-11-18T18:38:12 1763491092

Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/

zone411 · 2025-11-18T17:57:40 1763488660

Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).

Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.

Gemini 2.5 Pro scored 57.6, so this is a huge improvement.

zone411 · 2025-10-24T17:32:30 1761327150

You got many answers already, but a couple more points:

Poker doesn't require lying or table talk. Bluffing is rule-legal strategic deception expressed through betting. More like a feint in sports than cheating.

If "sitting at a table following rules" is the issue, that's true of most games. And formats vary: many are short and cash games are leave-anytime.

zone411 · 2025-10-15T18:19:49 1760552389

I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.

whatreason · 2025-10-16T04:55:26 1760590526

This is such a cool benchmark idea, love it

Do you have any other cool benchmarks you like? Especially any related to tools

shangofox · 2025-10-16T09:30:17 1760607017

You could try wordle on it. But from my own experience all of them are pretty bad. They're not smart enough to pick up the colours represented as letters. The only one that actually was good was Qwen surprisingly.

zone411 · 2025-09-20T07:06:31 1758351991

Matches Grok 4 at the top of the Extended NYT Connections leaderboard: https://github.com/lechmazur/nyt-connections/

bn-l · 2025-09-20T09:49:53 1758361793

Ahh so this might be the Sonoma sky Alpha that was gathering feedback on openrouter recently.

I tried that one extensively (it was free) and was disappointed vs regular grok 4 so also maybe not.

zone411 · 2025-08-08T04:05:04 1754625904

It is the new leader on my Short Story Creative Writing benchmark: https://github.com/lechmazur/writing/

zone411 · 2025-08-07T22:23:53 1754605433

GPT-5 set a new record on my Confabulations on Provided Texts benchmark: https://github.com/lechmazur/confabulations/

metzpapa · 2025-08-07T22:42:23 1754606543

For how much I’ve seen it pushed that this model has lower hallucination rates, it’s quite odd that every actual test I’ve seen says the opposite.

alkyon · 2025-08-08T22:37:54 1754692674

Maybe its training set included this repo?