gurachek's comments

gurachek · 2026-03-28T15:22:04 1774711324

I had exactly this between two LLMs in my project. An evaluator model that was supposed to grade a coaching model's work. Except it could see the coach's notes, so it just... agreed with everything. Coach says "user improved on conciseness", next answer is shorter, evaluator says yep great progress. The answer was shorter because the question was easier lol.

I only caught it because I looked at actual score numbers after like 2 weeks of thinking everything was fine. Scores were completely flat the whole time. Fix was dumb and obvious — just don't let the evaluator see anything the coach wrote. Only raw scores. Immediately started flagging stuff that wasn't working. Kinda wild that the default behavior for LLMs is to just validate whatever context they're given.

joquarky · 2026-03-28T22:01:00 1774735260

This is probably why these models can't say "I don't know". If they could, then that would be the only response they would give for everything.

gurachek · 2026-03-29T13:24:13 1774790653

Yeah, I think so. So far, Claude Opus is the only model I found that doesn't fold under the minimal pressure and can push back, but still - push just a little bit harder and it's back to "appear productive and useful to the user". I don't even have an idea how to balance it in LLMs to keep their business alive :D

gurachek · 2026-03-28T02:03:20 1774663400

The examples in the article are all big scary wipes, But I think the more common damage is way smaller and harder to notice.

I've been using claude code daily for months and the worst thing that happened wasnt a wipe(yet). It needed to save an svg file so it created a /public/blog/ folder. Which meant Apache started serving that real directory instead of routing /blog. My blog just 404'd and I spent like an hour debugging before I figured it out. Nothing got deleted and it's not a permission problem, the agent just put a file in a place that made sense to it.

jai would help with the rm -rf cases for sure but this kind of thing is harder to catch because its not a permissions problem, the agent just doesn't know what a web server is.

gurachek · 2026-03-26T02:27:52 1774492072

The compounding booboos bit is the key insight here. Humans are a bottleneck and that bottleneck is actually load-bearing. You feel the pain of bad decisions slowly enough to course correct.

I've been building the same AI product for months - a coaching loop that persists across sessions. Every few weeks someone ships a "competitor" in a weekend. Feature list looks similar. The difference is everything that breaks when a real user comes back for session 3 or 4. Context drifts, scores stop calibrating, plans don't adapt. None of that shows up in a demo. You only find it after sitting in the same codebase for weeks, running real sessions, getting confused by your own data. That's the friction the post is talking about and I don't think you can skip it.

dgb23 · 2026-03-26T09:30:55 1774517455

I like the framing of „context drift“. It describes the problem in LLM/agent terms.

Similar how „tech debt“ describes the same mechanism in business terms.

gurachek · 2026-03-26T00:08:15 1774483695

The float comparison slider is great.

One thing from practical experience - the quality gap between model sizes shows up in a way benchmarks don't capture. I have a system where a smaller model generates plans and a larger model can override them. On any single output they look comparable. The difference shows up 3-4 steps later — small model makes a decision that sounds reasonable but compounds into a bad plan. Perplexity won't catch that, KL divergence won't either. They both measure one prediction at a time.

gurachek · 2026-03-25T23:12:33 1774480353

The union rep gets it - people improvise when you cut their tools and then threaten discipline for improvising.

That memo is how you make staff hide things instead of asking for help.

The scarier part though is that LLM-written clinical notes probably look fine. That's the whole problem. I built a system where one AI was scoring another AI's work, and it kept giving high marks because the output read well. I had to make the scorer blind to the original coaching text before it started catching real issues. Now imagine that "reads well, isn't right" failure mode in clinical documentation.

Nobody's re-reading the phrasing until a patient outcome goes wrong.

salawat · 2026-03-25T23:39:30 1774481970

Physicians need to have it pounded into them that every hallucination is downstream harm. AI has no place in medicine. If they insist on it, then all transcripts must be stored with the raw audio. Which should be accessible side by side, with lines of transcript time coded. It's the only way to actually use these safely, while guarding against hallucinations.

gurachek · 2026-03-26T00:15:21 1774484121

Raw audio is a cool idea! I've seen a similar approach in other domains, "keep the source of truth accessible so you can verify the AI output against it".

I wouldn't go as far as "no place in medicine" though. The Heidi scribe tool mentioned in the article is a good example, because in the end it's the doctor who reviews and signs off.

IMO the problem is AI doing the work with no human verification step, but I can 100% agree I don't want to have vibe-doctor for my next surgery/consult :D

bonsai_spool · 2026-03-26T02:27:06 1774492026

> Physicians need to have it pounded into them that every hallucination is downstream harm.

I think any person using 'AI' knows it makes mistakes. In a medical note, there are often errors at present. A consumer of a medical note has to decide what makes sense and what to ignore, and AI isn't meaningfully changing that. If something matters, it's asked again in follow up.

theshackleford · 2026-03-26T11:02:59 1774522979

> I think any person using 'AI' knows it makes mistakes.

You think wrong. I’m now encountering people on a regular basis arguing “those days are behind us” and it’s “old news.”

totetsu · 2026-03-26T04:27:01 1774499221

ASR models can output a confidence score along with the text, but it is rarely used in the UI to display the results.. or maybe lost entirely in a subsequent LLM layer.

gurachek · 2026-03-22T16:18:56 1774196336

Your "no compiler" rule on day 3 taught you more than the LLM did. The LLM made concepts click. But the binary search vanishing under interview stress proves that understanding something and being able to produce it under pressure are totally different skills. Nobody talks about this enough in the "just use ChatGPT to learn" discourse.

krackers · 2026-03-22T20:53:35 1774212815

There is this famous quote from Bentley on asking programmers to write binary search

>I’ve assigned this problem [binary search] in courses at Bell Labs and IBM. Professional programmers had a couple of hours to convert the above description into a program in the language of their choice; a high-level pseudocode was fine. At the end of the specified time, almost all the programmers reported that they had correct code for the task. We would then take thirty minutes to examine their code, which the programmers did with test cases. In several classes and with over a hundred programmers, the results varied little: ninety percent of the programmers found bugs in their programs (and I wasn’t always convinced of the correctness of the code in which no bugs were found).

>I was amazed: given ample time, only about ten percent of professional programmers were able to get this small program right. But they aren’t the only ones to find this task difficult: in the history in Section 6.2.1 of his Sorting and Searching, Knuth points out that while the first binary search was published in 1946, the first published binary search without bugs did not appear until 1962.

The invariants are "tricky", not necessarily hard but also not trivial to where you can convert your intuitive understanding back into code "with your eyes closed". Especially since most implementations you write will only be "subtly flawed" rather than outright broken. Randomizing an array is also one of the algorithms in this class, conceptually easy but most implementations will be "almost right", not actually generating all permutations.

qikcik · 2026-03-22T17:15:32 1774199732

You are 100% right. For me, the most important thing is that the LLM teacher allowed me to break through my algorithmic ignorance in just one week.

The rest is somehow orthogonal to the LLM and is just pure practice. It is very easy to procrastinate with an LLM without actual practice.

It allowed me to actually see the problem space and something like the "beauty of classical algorithms". It shifted my "unknown unknowns" into "known unknowns". I had failed so many times to achieve exactly that without an LLM in the past.

gurachek · 2026-03-22T17:24:20 1774200260

Yeeah, LLMs are the perfect procrastination tool because they feel productive. You're "learning", you're "exploring", you're having this great conversation about the problem. And then you close the tab and realize you never actually wrote anything yourself.

The best procrastination device ever built because it validates you the entire time. Great post, even beyond the algorithms example.

gurachek · 2026-03-07T01:06:20 1772845580

I wrote this after going through Fabric HQ's data. The 61% stat surprised me - the cheating actually works, at least for now. Happy to discuss the implications.

gurachek · 2026-03-03T14:27:18 1772548038

I built this because it's easier for me to treat interview & job search as a math game, and math needs numbers(lol?). The original component works with voice, because engineers delivery collapse under interview pressure and it's hard to practice with just text.

The tool scores your answer on Structure, Completeness, Clarity, and Conciseness (0-10 each), then gives you one specific fix. No signup required.

Built with Laravel + Vue + Claude Sonnet 4.6. The scoring rubric is visible on the page + OG image.

Looking for feedback on the scoring calibration especially. Does it feel accurate to your experience?

gurachek · 2026-03-03T14:39:48 1772548788

Forgot to include: answers not stored.

Here is DB schema(content hash is to save money on requests with the same answer):

| # | column_name | data_type | |----|--------------|-------------------------------------------------------------| | 1 | id | bigint(20) unsigned | | 2 | job_id | char(36) | | 3 | ip_address | varchar(45) | | 4 | user_id | bigint(20) unsigned | | 5 | email | varchar(255) | | 6 | question | text | | 7 | answer | text | | 8 | content_hash | varchar(64) | | 9 | status | enum('pending','processing','complete','failed') | | 10 | result | longtext | | 11 | scores_count | int(10) unsigned | | 12 | created_at | timestamp | | 13 | updated_at | timestamp |

gurachek · on May 27, 2022

Hey guys!

It's a concept at the moment. But still, what do you think about such a tool if it's real?

Should I invest more time building such stuff?