Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.
There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.
Pedantry: are you using "finite state machine" as a stand-in for "abstract machine"? If so, OK, otherwise, if you mean it in the sense of a Finite State Automaton (FSA), digital computers of the kind we all use nowadays are Turing machines (with finite tapes). If digital computers were FSAs there would be a good many things you wouldn't be able to do with them, e.g. sorting a list (because that's a context-sensitive task).
The usual. A computer is a computer with memory cells, CPU and so on, or the Turing machine for certain purposes. A deep neural network is a deep neural network with a certain architecture (it allows to reason about training, at least), or the universal approximator (in the same limited way as a computer is the Turing machine), or a function (but it's too general).
And, well, what would a human do when they encounter a problem of recognizing an arbitrarily complex grammar? Would they try to learn it? Would they succeed? Or would they write a program to recognize it? Is writing a program in TC0?
It might be a genuine limitation that guaranties subhuman performance at practically achievable network sizes, but it's not that straightforward.
That's actually a good point I had not thought of before... do you happen to know what this concept is called in the literature?
Indeed, even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages, that is still sufficient to write programs that can then simulate the turing machine and give you turing completeness in a "meta" way. (The catch might be that the program is only turing complete in an "abstract" sense, just as in the real-world running programs don't have infinite tape).
The term for research on whether LLMs can learn and execute this or that algorithm of program synthesis? I guess there's none. If we do know the algorithm, it's better to teach LLM to write a program that outputs a program. If we don't know the algorithm, we can't reason whether LLM can do it.
> even if someone discovered that LLMs can only ever emit/recognize context-sensitive languages
https://arxiv.org/abs/2310.07923 Their recognition abilities are limited by context-sensitive languages, if the number of steps in chain-of-thought is linear in the length of input.
All in all, asymptotic behavior of a network on an algorithmic task doesn't matter that much for AI. What matters is whether the network can simplify a problem to a manageable length.