I agree that for chatbots that offer APIs, they are most likely currently implem...

I agree that for chatbots that offer APIs, they are most likely currently implemented as stateless.

Meaning they take as input the last "context window" characters from the client, and use it to recompute the internal state, and then start generating character by character. But after the generation no memory need to be kept used on the server (except a very small "context window tokens").

Chatbots like llama.cpp in interactive mode don't have to recompute this internal state at every interaction.

You can view the last "context window" characters as a compressed representation of the internal state.

This becomes more pertinent as the "context window" gets bigger, as the bigger the "context window" the more you will have to recompute at each interaction.

The transformer architecture can also be trained differently so as to generate "context vectors" of finite size that synthesize all the past previous message of the conversation (encoder-decoder architecture). This "context vector" can be kept on the server more easily, and will contain the gist and the important things of the conversation, but won't be able to quote things exactly from the past directly. This context vector is then used to condition the generation of the reply. And once the chatbot has replied and received a new prompt, you update the finite size context vector (with a distinct neural network) to get a context vector with the latest information incorporated that you use to condition the generation, ad infinitum.