Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It depends on how large the input prompt (previous context) is. Also, if you can keep cache on GPU with a LRU mechanism, for certain workloads it's very efficient.

You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios.



If you can pipeline upcoming requests and tie state to a specific request, doesn't that allow you to change how you design physical memory? (at least for inference)

Stupid question, but why wouldn't {extremely large slow-write, fast-read memory} + {smaller, very fast-write memory} be a feasible hardware architecture?

If you know many, many cycles ahead what you'll need to have loaded at a specific time.

Or hell, maybe it's time to go back to memory bank switching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: