If you mean “future branches on other threads”, I recommend re-thinking your algorithm and/or data structures.
If the hardware has strong memory order like AMD64, on a massively-multicore CPU your code will be very slow because cache coherency protocols. Reading from a cache line recently modified by another core is typically even more expensive than L3 cache miss and roundtrip to system RAM.
Or if the hardware has weak memory order (like GPUs unless using interlocked instructions which are done by dedicated piece of hardware shared across the complete chip), recent changes to your state will be ignored by other cores. Other cores may load old data from their local caches. The performance is better, but the results are incorrect.
Future branches of the same "thread" luckily aka the entire book keeping data structure is local to the an thread of execution of the algorithm while access and modification of the book keeping structure is essentially random access, for any notion of "nearby" threads i can think of and future branch decision can depend on pretty arbitrary parts of that data structure. See my other long reply for more context. Synchronisation between threads can happen at pretty arbitrary time and involves exchanges of new rows in the book keeping structure
I’ve read that comment but I still don’t understand what it is you’re computing.
It seems you have substantial amount of complicated C++ written without much thoughts about performance, and now you want to improve performance without spending too much time reworking things.
If that’s the case, I’m not sure this gonna work. You spend hours using profiler and trying micro-optimizations, but if the data structures involved aren’t good (for instance, too many pointers across things) these micro-optimizations won’t necessarily help much.
Also, I believe no modern hardware has general ways to efficiently synchronize fine-grained stuff across cores. Shared memory works, but the latency is not great for fine-grained tasks measured in microseconds or less.
I appreciate you challenging my ideas as far as your knowledge took you. I appreciate that you stopped when my explanation was insufficient. Your diagnosis:
> It seems you have substantial amount of complicated C++ written without much thoughts about performance, and now you want to improve performance without spending too much time reworking things.
is correct expect it was someone else who wrote an "substantial amount of complicated C++ written without much thoughts about performance" and now i "want to improve performance".
Well, I’m afraid for your case there’s no silver bullet, neither hardware nor software. If you really need to improve things, you should refactor code, and especially data structures, for performance. Couple tips.
I recommend staying away from GPGPUs, at least for now. Porting things from CPU to GPU, especially in the cross-platform code base, is relatively hard on it’s own. Unless you’re fine with vendor lock-in to nVidia: CUDA is easier to integrate than the others.
Viewing SIMD lanes as equivalents of GPU threads is only one possible approach. It’s also possible to leverage these fixed-length vectors as they are, within a single logical thread. A trivial example — all modern implementations of memcpy() are using SSE2 or AVX instructions to sequentially move these bytes without any parallelism. You obviously don’t need to implement memcpy because already in the standard library, but SSE2 and AVX2 sets have hundreds of instructions for manipulating integer lanes in these vectors.
If you mean “future branches on other threads”, I recommend re-thinking your algorithm and/or data structures.
If the hardware has strong memory order like AMD64, on a massively-multicore CPU your code will be very slow because cache coherency protocols. Reading from a cache line recently modified by another core is typically even more expensive than L3 cache miss and roundtrip to system RAM.
Or if the hardware has weak memory order (like GPUs unless using interlocked instructions which are done by dedicated piece of hardware shared across the complete chip), recent changes to your state will be ignored by other cores. Other cores may load old data from their local caches. The performance is better, but the results are incorrect.