Like it or not, LLMs are effectively high-order Markov chains

BenoitEssiambre · 2025-09-23T19:35:59 1758656159

Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.

benob · 2025-09-23T20:13:23 1758658403

There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.

[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...

krackers · 2025-09-23T22:13:43 1758665623

Only in the way that every real-world computer is a finite state machine.

bigfishrunning · 2025-09-24T01:54:41 1758678881

Every real-world computer is a finite state machine, there's just a lot of states. I guess I'm unsure of the point that you're trying to make here

YeGoblynQueenne · 2025-09-25T10:13:47 1758795227

Pedantry: are you using "finite state machine" as a stand-in for "abstract machine"? If so, OK, otherwise, if you mean it in the sense of a Finite State Automaton (FSA), digital computers of the kind we all use nowadays are Turing machines (with finite tapes). If digital computers were FSAs there would be a good many things you wouldn't be able to do with them, e.g. sorting a list (because that's a context-sensitive task).

red75prime · 2025-09-24T08:03:08 1758700988

The point? It's an almost useless way of looking at things.

YeGoblynQueenne · 2025-09-25T10:14:38 1758795278

What is the useful way?

red75prime · 2025-09-26T19:14:02 1758914042

The usual. A computer is a computer with memory cells, CPU and so on, or the Turing machine for certain purposes. A deep neural network is a deep neural network with a certain architecture (it allows to reason about training, at least), or the universal approximator (in the same limited way as a computer is the Turing machine), or a function (but it's too general).

krackers · 2025-09-27T22:06:59 1759010819

For neural networks, the circuit complexity model (e.g. class TC0) is probably better suited.

red75prime · 2025-09-28T23:34:59 1759102499

It's the circuit complexity of one forward pass.

And, well, what would a human do when they encounter a problem of recognizing an arbitrarily complex grammar? Would they try to learn it? Would they succeed? Or would they write a program to recognize it? Is writing a program in TC0?

It might be a genuine limitation that guaranties subhuman performance at practically achievable network sizes, but it's not that straightforward.

krackers · 2025-09-30T05:21:16 1759209676

That's actually a good point I had not thought of before... do you happen to know what this concept is called in the literature?

Indeed, even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages, that is still sufficient to write programs that can then simulate the turing machine and give you turing completeness in a "meta" way. (The catch might be that the program is only turing complete in an "abstract" sense, just as in the real-world running programs don't have infinite tape).

red75prime · 2025-09-30T09:02:04 1759222924

The term for research on whether LLMs can learn and execute this or that algorithm of program synthesis? I guess there's none. If we do know the algorithm, it's better to teach LLM to write a program that outputs a program. If we don't know the algorithm, we can't reason whether LLM can do it.

> even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages

https://arxiv.org/abs/2310.07923 Their recognition abilities are limited by context-sensitive languages, if the number of steps in chain-of-thought is linear in the length of input.

All in all, asymptotic behavior of a network on an algorithmic task doesn't matter that much for AI. What matters is whether the network can simplify a problem to a manageable length.

guluarte · 2025-09-23T19:26:24 1758655584

markov chains with limited self correction