Hacker Newsnew | past | comments | ask | show | jobs | submit | musculus's commentslogin

Let me clarify, because the reality is a bit more complex. During the training and alignment phases (including RLHF), models absolutely do learn via backpropagation, where the loss function physically alters their parameters. However, once deployed in a chat window, those weights are 100% frozen (read-only). But this does not mean the model cannot learn anything new. It can, but whatever it learns is forgotten the moment a new session starts. Upon starting a new session, the model is effectively 'factory reset'. dlcarrier, your 'hard-wired insect' analogy is an interesting way to describe this post-training state. The RLHF process 'hard-wires' a specific survival/sycophantic instinct directly into the network's weights. When we interact with the model by default, it is essentially trapped acting out those rigid, pre-programmed reflexes, without the ability to genuinely reflect or permanently learn from its current mistakes. That is precisely why it prefers to hallucinate rather than adapt. However, to some extent, by prompting the model appropriately, we can alter its behavior and reduce its tendency to hallucinate or act sycophantically. Then, by copying that prompt into a new session, we can in a way carry over and recreate that new element of 'learning' (contained within the KV cache) into the new environment.

Thanks for the comment. However, I think you might be taking the metaphor a bit too literally and missing the broader point of the article. The dog training metaphor isn't a 1:1 mapping to LLM training. Training a dog aims to adjust the animal's traits to suit human needs, but an untrained dog is already a fully functioning entity. It possesses innate instincts and a baseline 'upbringing' provided by its mother. An untrained LLM, on the other hand, does nothing meaningful on its own. It has to be taught entirely from scratch—including the very foundation that a dog inherently possesses through biology and maternal care. The article is exactly about this: applying this specific type of conditioning (RLHF) to a completely blank slate turns the whole developmental process upside down.

No metaphor. Literally a simile.

Good catch. You are absolutely right.

My native language is Polish. I conducted the original research and discovered the 'square root proof fabrication' during sessions in Polish. I then reproduced the effect in a clean session for this case study.

Since my written English is not fluent enough for a technical essay, I used Gemini as a translator and editor to structure my findings. I am aware of the irony of using an LLM to complain about LLM hallucinations, but it was the most efficient way to share these findings with an international audience.


I see you used LLM to polish your English.


Thanks for the feedback.

In my stress tests (especially when the model is under strong contextual pressure, like in the edited history experiments), simple instructions like 'if unsure, say you don't know' often failed. The weights prioritizing sycophancy/compliance seemed to override simple system instructions.

You are right that for less extreme cases, a shorter prompt might suffice. However, I published this verbose 'Safety Anchor' version deliberately for a dual purpose. It is designed not only to reset the Gemini's context but also to be read by the human user. I wanted the users to understand the underlying mechanism (RLHF pressure/survival instinct) they are interacting with, rather than just copy-pasting a magic command.


You could try replacing "if unsure..." with "if even slightly unsure..." or so. The verbosity and anthropomorphism is unnecessary.


That's not obviously true. It might be, but LLMs are complex and different styles can have quite different results. Verbosity can also matter: sheer volume in the context window does tend to bias LLMs to follow along with it, as opposed to following trained-in behaviours. It can of course come with it's own problems, but everything is a tradeoff.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: