Okay, so we have a function that converts a buffer of tokens to a distribution of next tokens. We give the function a buffer, get back the next-token-distribution, from which we sample to get our next token. We append that to the buffer (and possibly roll off the first token) to obtain a new buffer. If we consider entire buffers to be states (which is a perfectly reasonable thing to do), then we have a stochastic process that moves us from one state to another. How is that not a Markov chain?
That's fair, and yes it is a Markov chain if you frame it that way. Albeit with the interesting property that the state is very very high dimensional, and the current and previous states never differ by more than 1 token.