First, what a great resource you've put together! You're presenting a lot of useful material clearly and concisely, together with the reasoning behind it. I wish I had this when I started doing performance work.
> For a perfect drop-in replacement of std::lower_bound, the best you can do without breaking anything is to make the search branchless and maybe add some prefetching [1]. Some compilers actually try to implement std::lower_bound this way, but these things can sometimes break from version to version because there are no semantics to reliably make compiler use predication instead of branching.
What's imperfect about a radix-4 (or higher) search instead of a binary search for replacing std::lower_bound? For a radix-k search, you'll have k-1 fetches in flight before you decide which subarray to recurse on, and your dep chain will be log_2(k) times shorter.
When you add prefetching (that is, compare against the middle element and fetch both the middle of the left half and the middle of the right half ahead of time) you are essentially doing radix-4 search, just with fewer actual comparisons.
(You can prefetch k layers ahead for radix-2^k search, but searching in small arrays will become more expensive this way.)
I didn't benchmark it, but I guess on mid-to-large arrays it would actually work somewhat slower than prefetching because it is more instruction-heavy, and, more importantly, prefetches are "cancelable": if the memory bus is too busy with actual requests, they will be skipped, while in explicit radix-k search you would have to wait for all (k-1) elements even if the middle element happened to be already cached and you already know which elements you need next.
That said, it could probably work with small arrays where caching is not a concern and especially if you optimize for latency and not throughput. You can also try to do the comparisons with SIMD (Wojciech Mula tried something similar and got a small boost: http://0x80.pl/articles/simd-search.html).
> For a perfect drop-in replacement of std::lower_bound, the best you can do without breaking anything is to make the search branchless and maybe add some prefetching [1]. Some compilers actually try to implement std::lower_bound this way, but these things can sometimes break from version to version because there are no semantics to reliably make compiler use predication instead of branching.
What's imperfect about a radix-4 (or higher) search instead of a binary search for replacing std::lower_bound? For a radix-k search, you'll have k-1 fetches in flight before you decide which subarray to recurse on, and your dep chain will be log_2(k) times shorter.