You have five bodies and 16 sse registers. The entire state of your simulation can fit into register space and you dont need to ever access memory during the stepping part of your code. You can loop unroll all gravity interactions so you end up with one large branchless memoryless block of code. Now that its completely inline you can rearrange your dependencies based on expected latency and throughput of operations (https://software.intel.com/sites/landingpage/IntrinsicsGuide...)
Then after that you can merge the operations where you can. (for SSE4 at most you are going to get is 2x because you are using doubles)
You may think the full inlining is cheating but the compiler has the same information as your bodies list is entirely constant. (since your dt and your masses are constant they can also potentially be folded).
Then after that you can merge the operations where you can. (for SSE4 at most you are going to get is 2x because you are using doubles)
You may think the full inlining is cheating but the compiler has the same information as your bodies list is entirely constant. (since your dt and your masses are constant they can also potentially be folded).