It's not a Markov chain because it doesn't obey the Markov property.
What it is, and what I assume you mean, is a next-word prediction model based solely on the previous sequence of words, up to some limit. It literally is that, because it was designed to be that.
Okay, so we have a function that converts a buffer of tokens to a distribution of next tokens. We give the function a buffer, get back the next-token-distribution, from which we sample to get our next token. We append that to the buffer (and possibly roll off the first token) to obtain a new buffer. If we consider entire buffers to be states (which is a perfectly reasonable thing to do), then we have a stochastic process that moves us from one state to another. How is that not a Markov chain?
That's fair, and yes it is a Markov chain if you frame it that way. Albeit with the interesting property that the state is very very high dimensional, and the current and previous states never differ by more than 1 token.
What about a language model requires knowledge of previous states to choose the next state based on the current state? Are you asserting that only the last word chosen is the current state? Couldn’t the list of symbols chosen so far be the current state, so long as no data is carried forward about how that list has been generated?
What it is, and what I assume you mean, is a next-word prediction model based solely on the previous sequence of words, up to some limit. It literally is that, because it was designed to be that.