It's not a Markov chain because it doesn't obey the Markov property. What it *is...

qwelfkh · 2025-09-23T20:21:16 1758658876

Okay, so we have a function that converts a buffer of tokens to a distribution of next tokens. We give the function a buffer, get back the next-token-distribution, from which we sample to get our next token. We append that to the buffer (and possibly roll off the first token) to obtain a new buffer. If we consider entire buffers to be states (which is a perfectly reasonable thing to do), then we have a stochastic process that moves us from one state to another. How is that not a Markov chain?

nerdponx · 2025-09-23T23:51:19 1758671479

That's fair, and yes it is a Markov chain if you frame it that way. Albeit with the interesting property that the state is very very high dimensional, and the current and previous states never differ by more than 1 token.

cestith · 2025-09-23T19:54:47 1758657287

What about a language model requires knowledge of previous states to choose the next state based on the current state? Are you asserting that only the last word chosen is the current state? Couldn’t the list of symbols chosen so far be the current state, so long as no data is carried forward about how that list has been generated?

coliveira · 2025-09-23T20:23:18 1758658998

You're right, the recent memory of the LLM is the state, that's why it needs to be so deep to be effective compared to a traditional MC.