I share examples of LLM fails on our company Slack and every week LLMs do the op...

carb · 2025-08-15T07:07:54 1755241674

I've found better results when I treat LLMs like you would treat little kids. Don't tell them what NOT to do, tell them what TO do.

Say "keep your hands at your side, it's hot" and not "don't touch the stove, it's hot". If you say the latter, most kids touch the stove.

alpaca128 · 2025-08-15T12:08:01 1755259681

If LLMs cannot reliably deal with this, how can they write reliable code? Following an instruction like "don't do X" is more basic than the logic of fizzbuzz.

This reminds me of the query "shirt without stripes" on any online image/product search.

zahlman · 2025-08-15T18:50:24 1755283824

Obligatory reminder that we used to live in a world where you could put "foo -bar" into a search engine, ctrl-F for foo on the top ten results and find it every time, and ctrl-F for bar on the top ten results and not find it.

alpaca128 · 2025-08-18T09:44:32 1755510272

Yeah, I've even had cases where DDG ignored my quoted string in the search. It's literally the whole point of the quotes but especially when it contains things like German umlauts it'll just accept any replacement letter for them. And yes, getting no results is acceptable, in fact it is the only correct outcome.

amai · 2025-08-16T18:59:02 1755370742

Negation is a hard problem for AI and mainly unsolved:

- https://seantrott.substack.com/p/llms-and-the-not-problem

- https://github.com/elsamuko/Shirt-without-Stripes

glitchcrab · 2025-08-15T11:05:29 1755255929

My eureka moment when I first started using Cursor a few weeks back was realising that I talking to it the same way I talk to my three year old and the results were fairly good (less so from my boy at times).

IshKebab · 2025-08-15T11:42:32 1755258152

Yeah it's also kind of funny people discovering all the LLM failure modes and saying "see! humans would never do that! it's not really intelligent!". None of those people have children...

Chinjut · 2025-08-15T13:06:11 1755263171

I don't want a computer that's as unreliable as a child. This is not what originally interested me about computers.

IshKebab · 2025-08-15T18:02:18 1755280938

Nobody said you did. I'm talking about the confidently incorrect assertions that humans would never display any of these unreliable behaviours.

tripzilch · 2025-08-17T18:58:00 1755457080

They don't. At least not for the duration that LLMs keep it up. They really don't.

If you want to pretend that being a 3 year old is not a transient state, and that controlling an AI is just like parenting an eternal 3 year old, there's probably a manga about that.

jama211 · 2025-08-17T03:47:11 1755402431

Don’t be daft

tripzilch · 2025-08-17T18:53:39 1755456819

Maybe because none of those people are imagining children to be eternally stuck at that level of intelligence. At that age (regardless of being a parent or not) you can literally see them getting smarter over the course of weeks or months.

sothatsit · 2025-08-15T07:51:41 1755244301

I have also had this happen, but only when my context is getting too long, at which point models stop reading my instructions. Or if there have been too many back and forths, this can happen as well.

Tthere is a steady decline in model's capabilities across the board as their contexts get longer. Wiping the slate clean regularly really helps to counteract this, but it can really become a pain to rebuild the context from scratch over and over. Unfortunately, I don't really know any other way to avoid the model's getting really dumb over time.

maelito · 2025-08-15T08:00:18 1755244818

LLMs erasing your important comments is so irritating ! Happened to me often.

toenail · 2025-08-15T12:46:40 1755262000

I simply had claude write me a linting tool that catches its repeated bad stuff..

TheRealDunkirk · 2025-08-15T13:08:47 1755263327

I was converting all the views in my Rails app from HAML to ERB. It was doing each one perfectly, so I told it to do the rest. It went through a few, then asked me if it could write a program, and run that. I thought, hey, cool, sure. I get it; it was trying to save tokens. Clever! However -- you know where this is going -- despite knowing all the rules, and demonstrating it could apply them, the program it wrote made a total dog's breakfast out of the rest of the files. Thankfully, I've learned to commit my working copy before big "AI" changes, and I just revert when it barfs. I forced Claude to do the rest "manually" at great token expense, but it did it correctly. I've asked it to write other scripts, which it has also mangled. So I haven't been impressed at Claude's "tool writing" capability yet, and I'm jealous of people who seem to have good luck.

polynomial · 2025-08-16T06:08:18 1755324498

Imagine if you had to do this with an actual team member.

paulcole · 2025-08-15T15:09:51 1755270591

> I share examples of LLM fails on our company Slack and every week LLMs do the opposite of what I tell them.

Must be fun.

iamflimflam1 · 2025-08-15T07:43:16 1755243796

Do you also share examples of when it works really well?