And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files
Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms
MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms
You can get a load balanced endpoint serving hundreds of tok/s per GPU using vLLM or Text Generation Inference in a few lines of code:
https://modal.com/docs/guide/ex/text_generation_inference
And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files
Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms
MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms