Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

because then instead of RAM bandwidth now you're dealing with PCIe BW which is way less.


For LLM inference of batch size 1, it's hard to be saturate PCIe bandwidth specially for less powerful chips. You would get close to linear performance[1]. The obvious issue is few things on multiple GPU is harder, and many softwares don't fully support it or isn't optimized for it.

[1]: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...


Also less power efficient, takes up more PCI slots and a lot of software doesn't support GPU clustering. Already have 4x 16GB GPUs which is unable to run large models exceeding 16GB.

Currently running them different VMs to be able to make full use of them, used to have them running in different docker containers however OOM Exceptions would frequently bring down the whole server, which running in VMs helped resolve.


What’s your application for high-VRAM that doesn’t leverage multiple gpus?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: