Well, I’m afraid for your case there’s no silver bullet, neither hardware nor so...

Well, I’m afraid for your case there’s no silver bullet, neither hardware nor software. If you really need to improve things, you should refactor code, and especially data structures, for performance. Couple tips.

I recommend staying away from GPGPUs, at least for now. Porting things from CPU to GPU, especially in the cross-platform code base, is relatively hard on it’s own. Unless you’re fine with vendor lock-in to nVidia: CUDA is easier to integrate than the others.

Viewing SIMD lanes as equivalents of GPU threads is only one possible approach. It’s also possible to leverage these fixed-length vectors as they are, within a single logical thread. A trivial example — all modern implementations of memcpy() are using SSE2 or AVX instructions to sequentially move these bytes without any parallelism. You obviously don’t need to implement memcpy because already in the standard library, but SSE2 and AVX2 sets have hundreds of instructions for manipulating integer lanes in these vectors.