Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
FlashHead: Up to 40% Faster Multimodal Reasoning on Top of Quantization (huggingface.co)
1 point by Embedl-Wilhelm 10 days ago | hide | past | favorite | 1 comment
 help



We released a Cosmos-Reason2-2B W4A16 + FlashHead build optimized for Jetson devices. FlashHead is a drop-in replacement for the LM head that increases token generation throughput without sacrificing reasoning quality, on top of techniques like quantization.

Try it with vllm-serve:

ssh <your-orin>

docker run --rm -it \ --network host \ --runtime=nvidia \ --name=vllm-serve \ -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN_HERE> \ embedl/vllm:latest-jetson-orin-flashhead \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \ --gpu-memory-utilization 0.75 \ --trust-remote-code

curl localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead","messages":[{"role":"user","content":"Hi"}]}' Jetson video inference benchmark (TPS with batch size = 1, 12 frames, 1280×720):

Device FP16 W4A16 FlashHead Orin Nano OOM 43.7 53.5 AGX Orin 39.6 74.4 92.2 AGX Thor 56.2 88.3 128.2 Model: https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2-...

We’re Embedl, a research startup from Gothenburg, Sweden and the team behind FlashHead. Let us know what other models you’d like to see it applied to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: