Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Like it or not, LLMs are effectively high-order Markov chains


Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.


There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.

[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...


Only in the way that every real-world computer is a finite state machine.


Every real-world computer is a finite state machine, there's just a lot of states. I guess I'm unsure of the point that you're trying to make here


Pedantry: are you using "finite state machine" as a stand-in for "abstract machine"? If so, OK, otherwise, if you mean it in the sense of a Finite State Automaton (FSA), digital computers of the kind we all use nowadays are Turing machines (with finite tapes). If digital computers were FSAs there would be a good many things you wouldn't be able to do with them, e.g. sorting a list (because that's a context-sensitive task).


The point? It's an almost useless way of looking at things.


What is the useful way?


The usual. A computer is a computer with memory cells, CPU and so on, or the Turing machine for certain purposes. A deep neural network is a deep neural network with a certain architecture (it allows to reason about training, at least), or the universal approximator (in the same limited way as a computer is the Turing machine), or a function (but it's too general).


For neural networks, the circuit complexity model (e.g. class TC0) is probably better suited.


It's the circuit complexity of one forward pass.

And, well, what would a human do when they encounter a problem of recognizing an arbitrarily complex grammar? Would they try to learn it? Would they succeed? Or would they write a program to recognize it? Is writing a program in TC0?

It might be a genuine limitation that guaranties subhuman performance at practically achievable network sizes, but it's not that straightforward.


That's actually a good point I had not thought of before... do you happen to know what this concept is called in the literature?

Indeed, even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages, that is still sufficient to write programs that can then simulate the turing machine and give you turing completeness in a "meta" way. (The catch might be that the program is only turing complete in an "abstract" sense, just as in the real-world running programs don't have infinite tape).


The term for research on whether LLMs can learn and execute this or that algorithm of program synthesis? I guess there's none. If we do know the algorithm, it's better to teach LLM to write a program that outputs a program. If we don't know the algorithm, we can't reason whether LLM can do it.

> even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages

https://arxiv.org/abs/2310.07923 Their recognition abilities are limited by context-sensitive languages, if the number of steps in chain-of-thought is linear in the length of input.

All in all, asymptotic behavior of a network on an algorithmic task doesn't matter that much for AI. What matters is whether the network can simplify a problem to a manageable length.


markov chains with limited self correction




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: