Another thread finally made it click for me. I hadn't looked at the mechanics of why the current method is used. Which seems to be that a ton a synthetic training data is added that allows in band instructions.
And this is precisely the catch, as there is no out of band stream to a language model. It is only completing a single channel of text/tokens. So, yeah, I think I get it, now.