> Doesn't the unrolled version skip the speculation for three out of every four entries ?
It does, but what matters is removing data dependencies often enough. By only speculating once every 4 items, we're lowering the average number of instructions per iteration (from 16 to 7.7). In the meantime, and thanks to value speculation, the CPU is still able to start working on 4 next items while processing the current loop iteration.
Indeed it's not pure unrolling of the second function.
> Will it end up worse than the "naive" version due to all the branch mispredictions?
If the data is allocated randomly, then the next cell prediction will often be wrong. In turn, the branch predictor of the CPU will predict that the prediction is wrong and won't continue execution with the predicted value. So it's the same behavior as the original implementation. So the overhead is not due to branch misprediction, but to additional cycles needed to compute the predicted value.
If the data is allocated in chunks, then the optimization remains useful as the cost of rolling back the CPU pipeline is not that much (I tested with the original list cut in 10 chunks and it added 100ms caused by 100k additional branch misses).
> Doesn't the unrolled version skip the speculation for three out of every four entries ?
It does, but what matters is removing data dependencies often enough. By only speculating once every 4 items, we're lowering the average number of instructions per iteration (from 16 to 7.7). In the meantime, and thanks to value speculation, the CPU is still able to start working on 4 next items while processing the current loop iteration.
Indeed it's not pure unrolling of the second function.
> Will it end up worse than the "naive" version due to all the branch mispredictions?
If the data is allocated randomly, then the next cell prediction will often be wrong. In turn, the branch predictor of the CPU will predict that the prediction is wrong and won't continue execution with the predicted value. So it's the same behavior as the original implementation. So the overhead is not due to branch misprediction, but to additional cycles needed to compute the predicted value.
If the data is allocated in chunks, then the optimization remains useful as the cost of rolling back the CPU pipeline is not that much (I tested with the original list cut in 10 chunks and it added 100ms caused by 100k additional branch misses).