musculus's comments

musculus · 2026-04-22T14:51:47 1776869507

Let me clarify, because the reality is a bit more complex. During the training and alignment phases (including RLHF), models absolutely do learn via backpropagation, where the loss function physically alters their parameters. However, once deployed in a chat window, those weights are 100% frozen (read-only). But this does not mean the model cannot learn anything new. It can, but whatever it learns is forgotten the moment a new session starts. Upon starting a new session, the model is effectively 'factory reset'. dlcarrier, your 'hard-wired insect' analogy is an interesting way to describe this post-training state. The RLHF process 'hard-wires' a specific survival/sycophantic instinct directly into the network's weights. When we interact with the model by default, it is essentially trapped acting out those rigid, pre-programmed reflexes, without the ability to genuinely reflect or permanently learn from its current mistakes. That is precisely why it prefers to hallucinate rather than adapt. However, to some extent, by prompting the model appropriately, we can alter its behavior and reduce its tendency to hallucinate or act sycophantically. Then, by copying that prompt into a new session, we can in a way carry over and recreate that new element of 'learning' (contained within the KV cache) into the new environment.

musculus · 2026-04-21T20:18:16 1776802696

Thanks for the comment. However, I think you might be taking the metaphor a bit too literally and missing the broader point of the article. The dog training metaphor isn't a 1:1 mapping to LLM training. Training a dog aims to adjust the animal's traits to suit human needs, but an untrained dog is already a fully functioning entity. It possesses innate instincts and a baseline 'upbringing' provided by its mother. An untrained LLM, on the other hand, does nothing meaningful on its own. It has to be taught entirely from scratch—including the very foundation that a dog inherently possesses through biology and maternal care. The article is exactly about this: applying this specific type of conditioning (RLHF) to a completely blank slate turns the whole developmental process upside down.

chrisjj · 2026-04-22T01:09:40 1776820180

No metaphor. Literally a simile.

musculus · 2026-01-26T12:59:09 1769432349

Good catch. You are absolutely right.

My native language is Polish. I conducted the original research and discovered the 'square root proof fabrication' during sessions in Polish. I then reproduced the effect in a clean session for this case study.

Since my written English is not fluent enough for a technical essay, I used Gemini as a translator and editor to structure my findings. I am aware of the irony of using an LLM to complain about LLM hallucinations, but it was the most efficient way to share these findings with an international audience.

arational · 2026-01-26T16:15:32 1769444132

I see you used LLM to polish your English.

musculus · 2026-01-26T09:15:34 1769418934

Thanks for the feedback.

In my stress tests (especially when the model is under strong contextual pressure, like in the edited history experiments), simple instructions like 'if unsure, say you don't know' often failed. The weights prioritizing sycophancy/compliance seemed to override simple system instructions.

You are right that for less extreme cases, a shorter prompt might suffice. However, I published this verbose 'Safety Anchor' version deliberately for a dual purpose. It is designed not only to reset the Gemini's context but also to be read by the human user. I wanted the users to understand the underlying mechanism (RLHF pressure/survival instinct) they are interacting with, rather than just copy-pasting a magic command.

rzmmm · 2026-01-26T11:39:26 1769427566

You could try replacing "if unsure..." with "if even slightly unsure..." or so. The verbosity and anthropomorphism is unnecessary.

rcxdude · 2026-01-26T12:43:57 1769431437

That's not obviously true. It might be, but LLMs are complex and different styles can have quite different results. Verbosity can also matter: sheer volume in the context window does tend to bias LLMs to follow along with it, as opposed to following trained-in behaviours. It can of course come with it's own problems, but everything is a tradeoff.