gongy's comments

gongy · 2025-06-20T03:09:26 1750388966

The improvement is real!

And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files

Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms

MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms

zhihaojia · 2025-06-20T03:52:00 1750391520

Thanks for reproducing our results!

gongy · on Sept 8, 2023

You can have load balanced inference and also fine tune models on Modal (disclaimer: I wrote this guide)

You can get a load balanced endpoint serving hundreds of tok/s per GPU using vLLM or Text Generation Inference in a few lines of code: