Hacker Newsnew | past | comments | ask | show | jobs | submit | gongy's commentslogin

The improvement is real!

And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files

Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms

MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms


Thanks for reproducing our results!


You can have load balanced inference and also fine tune models on Modal (disclaimer: I wrote this guide)

You can get a load balanced endpoint serving hundreds of tok/s per GPU using vLLM or Text Generation Inference in a few lines of code:

https://modal.com/docs/guide/ex/text_generation_inference


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: