It depends on how large the input prompt (previous context) is. Also, if you can keep cache on GPU with a LRU mechanism, for certain workloads it's very efficient.
You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios.
If you can pipeline upcoming requests and tie state to a specific request, doesn't that allow you to change how you design physical memory? (at least for inference)
Stupid question, but why wouldn't {extremely large slow-write, fast-read memory} + {smaller, very fast-write memory} be a feasible hardware architecture?
If you know many, many cycles ahead what you'll need to have loaded at a specific time.
Or hell, maybe it's time to go back to memory bank switching.
You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios.