I don't see why 6 is inherently worse than 4 or 8, not all of the layers are exactly equal or a power of 2 in count. 2^2, 2^3, vs 2^1*3^1 might give you more options.
The main issue I run into mainly is flops vs ram in any given card/model.
Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.
The main issue I run into mainly is flops vs ram in any given card/model.