(Author here) See https://github.com/clamchowder/Microbenchmarks/tree/master/Gpu...

jra101 · on Oct 20, 2022

Have you tried reducing the register count in your FP32 FMA test by increasing the iteration count and reducing the number of values computed per loop?

Instead of computing 8 independent values, compute one with 8x more iterations:

    for (int i = 0; i < count * 8; i++) {
        v0 += acc * v0; 
    }

That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.

clamchowder · on Oct 20, 2022

The problem is loop overhead matters on AMD, because AMD's compiler doesn't unroll the loop. Nvidia's does, so it doesn't matter for them.

WithinReason · on Oct 21, 2022

unroll with #pragma unroll?