If LLMs cannot reliably deal with this, how can they write reliable code? Following an instruction like "don't do X" is more basic than the logic of fizzbuzz.
This reminds me of the query "shirt without stripes" on any online image/product search.
Obligatory reminder that we used to live in a world where you could put "foo -bar" into a search engine, ctrl-F for foo on the top ten results and find it every time, and ctrl-F for bar on the top ten results and not find it.
Yeah, I've even had cases where DDG ignored my quoted string in the search. It's literally the whole point of the quotes but especially when it contains things like German umlauts it'll just accept any replacement letter for them. And yes, getting no results is acceptable, in fact it is the only correct outcome.
My eureka moment when I first started using Cursor a few weeks back was realising that I talking to it the same way I talk to my three year old and the results were fairly good (less so from my boy at times).
Yeah it's also kind of funny people discovering all the LLM failure modes and saying "see! humans would never do that! it's not really intelligent!". None of those people have children...
They don't. At least not for the duration that LLMs keep it up. They really don't.
If you want to pretend that being a 3 year old is not a transient state, and that controlling an AI is just like parenting an eternal 3 year old, there's probably a manga about that.
Maybe because none of those people are imagining children to be eternally stuck at that level of intelligence. At that age (regardless of being a parent or not) you can literally see them getting smarter over the course of weeks or months.
I have also had this happen, but only when my context is getting too long, at which point models stop reading my instructions. Or if there have been too many back and forths, this can happen as well.
Tthere is a steady decline in model's capabilities across the board as their contexts get longer. Wiping the slate clean regularly really helps to counteract this, but it can really become a pain to rebuild the context from scratch over and over. Unfortunately, I don't really know any other way to avoid the model's getting really dumb over time.
I was converting all the views in my Rails app from HAML to ERB. It was doing each one perfectly, so I told it to do the rest. It went through a few, then asked me if it could write a program, and run that. I thought, hey, cool, sure. I get it; it was trying to save tokens. Clever! However -- you know where this is going -- despite knowing all the rules, and demonstrating it could apply them, the program it wrote made a total dog's breakfast out of the rest of the files. Thankfully, I've learned to commit my working copy before big "AI" changes, and I just revert when it barfs. I forced Claude to do the rest "manually" at great token expense, but it did it correctly. I've asked it to write other scripts, which it has also mangled. So I haven't been impressed at Claude's "tool writing" capability yet, and I'm jealous of people who seem to have good luck.
I say capture logs without overriding console methods -> they override console methods.
YOU ARE NOT ALLOWED TO CHANGE THE TESTS -> test changed
Or they insert various sleep calls into a test to work around race conditions.
This is all from Claude Sonnet 4.