Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't see why 6 is inherently worse than 4 or 8, not all of the layers are exactly equal or a power of 2 in count. 2^2, 2^3, vs 2^1*3^1 might give you more options.

The main issue I run into mainly is flops vs ram in any given card/model.



Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.


interesting, thank you for the pointers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: