> which influence future branches If you mean “future branches on other threads”...

freemint · on Aug 16, 2022

Future branches of the same "thread" luckily aka the entire book keeping data structure is local to the an thread of execution of the algorithm while access and modification of the book keeping structure is essentially random access, for any notion of "nearby" threads i can think of and future branch decision can depend on pretty arbitrary parts of that data structure. See my other long reply for more context. Synchronisation between threads can happen at pretty arbitrary time and involves exchanges of new rows in the book keeping structure

Const-me · on Aug 16, 2022

I’ve read that comment but I still don’t understand what it is you’re computing.

It seems you have substantial amount of complicated C++ written without much thoughts about performance, and now you want to improve performance without spending too much time reworking things.

If that’s the case, I’m not sure this gonna work. You spend hours using profiler and trying micro-optimizations, but if the data structures involved aren’t good (for instance, too many pointers across things) these micro-optimizations won’t necessarily help much.

Also, I believe no modern hardware has general ways to efficiently synchronize fine-grained stuff across cores. Shared memory works, but the latency is not great for fine-grained tasks measured in microseconds or less.

freemint · on Aug 16, 2022

I appreciate you challenging my ideas as far as your knowledge took you. I appreciate that you stopped when my explanation was insufficient. Your diagnosis:

> It seems you have substantial amount of complicated C++ written without much thoughts about performance, and now you want to improve performance without spending too much time reworking things.

is correct expect it was someone else who wrote an "substantial amount of complicated C++ written without much thoughts about performance" and now i "want to improve performance".

Const-me · on Aug 17, 2022

Well, I’m afraid for your case there’s no silver bullet, neither hardware nor software. If you really need to improve things, you should refactor code, and especially data structures, for performance. Couple tips.

I recommend staying away from GPGPUs, at least for now. Porting things from CPU to GPU, especially in the cross-platform code base, is relatively hard on it’s own. Unless you’re fine with vendor lock-in to nVidia: CUDA is easier to integrate than the others.

Viewing SIMD lanes as equivalents of GPU threads is only one possible approach. It’s also possible to leverage these fixed-length vectors as they are, within a single logical thread. A trivial example — all modern implementations of memcpy() are using SSE2 or AVX instructions to sequentially move these bytes without any parallelism. You obviously don’t need to implement memcpy because already in the standard library, but SSE2 and AVX2 sets have hundreds of instructions for manipulating integer lanes in these vectors.