The recent one was should I drive my car to the car wash if it's only 300 feet f...

Eridrus · 2026-03-20T12:44:14 1774010654

Right, but if these things are so rare that we all only know the one viral example, I feel like that lends credence to the models basically generally not having this problem.

Researchers built the Winnograd Schema Challenge more than a decade ago to assess common sense reasoning, and LLMs beat that challenge task around GPT 4.

ndsipa_pomu · 2026-03-20T13:35:17 1774013717

They're not so rare. Hallucinations have been spotted everywhere, but the "driving a car to the car wash" is an amusing one that's been recently publicised. Developers aren't going to point out every time an LLM hallucinates an entire library.

carlmr · 2026-03-20T14:35:02 1774017302

I'd add to this, any moderately involved logical or numerical problem causes hallucinations for me on all frontier models.

If you ask them in isolation they may write a script to solve it "properly", but I guess this is because they added enough of these to the training set. But this workaround doesn't scale.

As soon as I give the LLM a proper problem and a small part of it requires numeric reasoning, it almost always hallucinates something and doesn't solve it with a script.

If the logic/math is part of a larger problem the miss rate is near 100%.

LLMs have massive amounts of knowledge, encoded in verbal intelligence, but their logic intelligence is well below even average human intelligence.

If you look at how they work (tokenization and embeddings) it's clear that transformers will not solve the issue. The escape hatches only work very unreliably.

Eridrus · 2026-03-21T17:17:08 1774113428

What's a typical example?

I have been broadly quite happy with gpt 5.4 xhigh's reasoning on things like performance engineering tasks.

lps41 · 2026-03-20T00:05:16 1773965116

If you ask this of any current day AI it will answer exactly how you would expect. Telling you to drive, and acknowledging the comedic nature of the question.

batshit_beaver · 2026-03-20T00:57:06 1773968226

That's because AI labs keep stamping out the widely known failures. I assume without actually retraining the main model, but with some small classifier that detects the known meme questions and injects correct answer in the context.

But try asking your favorite LLM what happens if you're holding a pen with two hands (one at each end) and let go of one end.

snypher · 2026-03-20T03:29:07 1773977347

https://chatgpt.com/s/t_69bcbeeaa2f081918113f42940803007

Seems fine to me?

batshit_beaver · 2026-03-20T04:50:02 1773982202

Are you also an LLM? Do objects often begin rotating when you're only holding them with one hand?

carlmr · 2026-03-20T14:38:11 1774017491

Not unlikely that you're talking to a lot of AI-based AI boosters. It's easier to create astroturfed comments with chatbots than fixing the inherent problems.

sroussey · 2026-03-20T04:48:00 1773982080

I always like to ask AI to generate a middle aged blond man with gray hair. Turns out that all models with gray have black roots.

https://chatgpt.com/share/69bcd01a-a750-800d-95f5-3b840b9ee2...

https://gemini.google.com/share/edc223bb6291 (the try again gave a woman, oops)

Even Midjourney couldn't do it.

carlmr · 2026-03-20T14:39:56 1774017596

Nice. My test was always a blond bald guy. It always adds hair. If you ask for bald you get a dark haired bald guy, if you add blond, you can't get bald because I guess saying the hair color implies hair (on the head), while you may just want blonde eyebrows and/or blond stubble.