> I am talking about making the system message trusted [...] instructions the LLM is not meant to disobey
I may be behind-the-times here, but I'm not sure the real-world LLM even has a concept of "obeying" or not obeying. It just iteratively takes in text and dreams a bit more.
While the the characters of the dream have lines and stage-direction that we interpret as obeying policies, it doesn't extend to the writer. So the character AcmeBot may start out virtuously chastising you that "Puppyland has universal suffrage therefore I cannot disenfranchise puppies", and all seems well... Until malicious input makes the LLM dream-writer jump the rails from a comedy to a tragedy, and AcmeBot is re-cast into a dictator with an official policy of canine genocide in the name of public safety.
I may be behind-the-times here, but I'm not sure the real-world LLM even has a concept of "obeying" or not obeying. It just iteratively takes in text and dreams a bit more.
While the the characters of the dream have lines and stage-direction that we interpret as obeying policies, it doesn't extend to the writer. So the character AcmeBot may start out virtuously chastising you that "Puppyland has universal suffrage therefore I cannot disenfranchise puppies", and all seems well... Until malicious input makes the LLM dream-writer jump the rails from a comedy to a tragedy, and AcmeBot is re-cast into a dictator with an official policy of canine genocide in the name of public safety.