I don't see why 6 is inherently worse than 4 or 8, not all of the layers are exa... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		StrangeDoctor on Aug 27, 2024 \| parent \| context \| favorite \| on: "Tinyboxes finally have a buy it now button" I don't see why 6 is inherently worse than 4 or 8, not all of the layers are exactly equal or a power of 2 in count. 2^2, 2^3, vs 2^1*3^1 might give you more options. The main issue I run into mainly is flops vs ram in any given card/model.

andersa on Aug 27, 2024 [–]

Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.

StrangeDoctor on Aug 27, 2024 | [–]

interesting, thank you for the pointers.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact