It's very much a work in progress, as noted in the article. And some of the stuff that worked reasonably well on my cards, like the instruction rate test when trying to measure throughput across the entire card, went down the drain when run on Arc.
Have you tried reducing the register count in your FP32 FMA test by increasing the iteration count and reducing the number of values computed per loop?
Instead of computing 8 independent values, compute one with 8x more iterations:
for (int i = 0; i < count * 8; i++) {
v0 += acc * v0;
}
That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.
It's very much a work in progress, as noted in the article. And some of the stuff that worked reasonably well on my cards, like the instruction rate test when trying to measure throughput across the entire card, went down the drain when run on Arc.