Where the state space would be proportional to the token length squared, just like the attention mechanisms we use today?
Eg imagine input of red followed by 32bits or randomness followed by blue forever. Markov chains would learn red leads to blue 32bits later. They’d just need to learn 2^32 states.
A few more leaps and we should eventually get models small enough to get close to information theoretic lower bounds of compression.
Where the state space would be proportional to the token length squared, just like the attention mechanisms we use today?