Not very well unfortunately. The author's confusion about how binary search works is preventing progress toward this solution. The idea is to do an "unbalanced" binary search, with the pivot chosen as offset (range size * big-endian value of hash / (maximum possible big-endian value of hash + 1)). Correctness is exactly the same as correctness of binary search, because the correctness of binary search does not depend on the offset of the pivot (so long as it's not zero or past the end). At some heuristic range length, fall back to linear search of the range. (Possibly, at some larger length, fall back to balanced binary search... But with efficient linear search of no more than a couple disk pages, I doubt this would help.)
This exactly. The author is not describing their approach very well, so I stopped reading because it was frustrating to try to figure out if they tried this solution.
The simple statement of the approach is:
Use binary search, but instead of using the middle as the pivot, use the estimated position heuristic as the pivot for each iteration.
> You could compute the expected position of the hash in the file as hash / 2^160 * filesize and then just look around in the vicinity of that.
> The idea is to do an "unbalanced" binary search, with the pivot chosen as offset (range size * big-endian value of hash / (maximum possible big-endian value of hash + 1)).
Thanks. I thought something along these lines but didn't know how to express or formalize it. Now I must study unbalanced binary search algorithms.
I tried it out and the maximum deviation of the actual position in the file from the estimated one is a bit under five megabyte for the version 8 file. That is not good enough to just fetch a single page by quite a bit. So I guess building an index is the only real option to get the number of reads down, but even then you will have at least one read. With an SSD that will take on the order of a few hundred microseconds, so it can be done under a millisecond without relying on a cache. Binary search will require about 20 to 30 reads, so without a cache it would be about that much slower, however the first couple of steps will quickly get cached, reducing the number of reads that actually hit the hardware quite a bit.
> Binary search will require about 20 to 30 reads,
This does not pass a sanity check. If one switches to linear scan at 4KiB = 2^12 bytes, then 30 (balanced, but approximately correct for unbalanced) binary search steps would search a 4TiB file. I don't know how big the hash file is but it isn't this big. Not even 4GiB big.
The maximum number of steps is lg(file size)-12, maybe plus 2 or so to account for the unbalanced search. (NB the unbalanced search approximates the balanced only because the hashes are approximately uniform; the relation doesn't hold in general.)
The version 8 file contains 847,223,402 hashes which gives 29.7 bits, that is where I got the 30 reads from. And I assumed the most naïve approach, always just seeking and reading a single hash, but admittedly this would almost certainly not read the final page over and over again. And then I just subtracted 10 taking into account that you would read the final page only once, with 10 being a rough approximation of 12. How many bytes you actually read per read could be affected by all kind of things - sector size, cluster size, page size, read ahead, ... - so 10 bit was good enough for me. But you are right, the number of reads would almost certainly be in the vicinity of 20, not 30.