It's not exactly "losing information". We are not trying to regenerate original ...

dragontamer · on March 16, 2019

> It's not exactly "losing information". We are not trying to regenerate original data, just make sure that all source bits can fairly influence the result. It's more a question of bit contribution.

I agree. But entropy is a great concept that helps allow us to calculate whether or not "input bits" can affect "output bits".

Lets look at XXH3_scrambleAcc:

    __m256i const k   = _mm256_loadu_si256 (xkey+i);
    __m256i const dk  = _mm256_mul_epu32 (data,k);
    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31);
    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
    xacc[i] = _mm256_xor_si256(dk, dk2);

Assume you start with 512-bits of input entropy (that is: assume that all 512-bits of state are uniformly random). Will you have 512-bits of uniformly random output after the above operations? Or to put it in your words: are you 100% sure that the above steps allow every input bit to affect every output bit?

It appears not. My counter-example (not that I've debugged your code... but based on my reading) is as follows: key[2] is 0x7c01812c, and key[3] is 0xf721ad1c. This means that key[2] AND key[3] are even numbers.

Which means dk.int64[1] bit0 will always be zero. And dk2.int64[1] bit0 will ALSO always be zero (on iteration 0, i=0)

I bet you, at least... if I did my math correctly... that xacc[0].int64[1].bit0 will ALWAYS be zero, no matter what state you start with (bit#64 of xacc[0] == 0)

This is because xacc[0].int64[1].bit0 = dk.int64[1].bit0 XOR dk2.int64[1].bit0

And both of those component parts seem to always be zero. Which means you're at LEAST wasting the 64th bit of state each time you perform XXH3_scrambleAcc. It certainly seems like a weakness of the XXH3_scrambleAcc function to me.

That's a lot of operations you do, to literally throw away half your bits. There's probably an optimization you can do here to get better mixing.

Cyan4973 · on March 16, 2019

Well, if you believe a better scrambling operation is possible, you are certainly welcomed to suggest one. Considering feedbacks on the algorithm is one of the objectives of the test phase. If the issue is about the default keys, it's also possible to change them, though one will also have to consider what happens with custom keys and if it implies ensuring some conditions (custom keys is a long-term objective of XXH3). At the end, XXH3 is only generating 64 and 128 bit hashes, and maybe 256 in the future, so compression necessarily happens somewhere. I believe the issue would be more pressing if the intention was to generate a 512-bit hash, but that has never been an objective.

janwas · on March 16, 2019

HighwayHash uses PSHUFB to spread entropy from twin 32x32=64 muls across 128-bit vectors [or vector halves, on AVX2]. See ZipperMerge in https://github.com/google/highwayhash/blob/master/highwayhas....

dragontamer · on March 16, 2019

> Well, if you believe a better scrambling operation is possible, you are certainly welcomed to suggest one.

Oh, criticism of the algorithm is the easy part, especially when I'm being vague and can't think of solid proof of my suppositions. Coming up with a better scrambling operation is the hard part. :-)

As I stated earlier, there are no obvious "red flags" in your algorithm. Its just this obscure part of the function that makes me think that optimization potential exists.

Since you only have 256-bits of entropy in your 512-bit accumulator, my goal if I were to go through your code is to cut the state down to 256-bits, while keeping 256-bits of entropy.

1. All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

2. Precisely throw away the top 32-bits of your multiplications, and only keep the bottom 32-bits. This will be slower, but it ensures full entropy when combined with odd numbers for the key.

3. Now that we have full entropy on the multiplication steps, cut down the 512-bit accumulator down to 256-bits, which should improve performance on AMD Zen / ARM NEON. It may help Intel Skylake, but Skylake is so wide that its hard to get ILP.

--------------------

Without testing the code, here's an idea.

    __m256i const k   = _mm256_loadu_si256 (xkey+i);
    __m256i const dk  = _mm256_mul_epu32 (data,k);
    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31); // <---- This line can be precomputed by the way...
    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
    // Above this line is the same code as before
    dk2= _mm256_shuffle_epi32 (dk2,0xc4);
    xacc[i] = _mm256_blend_epi32(dk, dk2, 0xaa);

As long as the keys were all odd (bottom bit set to 1), the above should mix the bits without losing any entropy. So you can now have a 256-bit state.

----------

The accumulate portion would be simply:

    __m256i const add = _mm256_xor_epi64(d, xacc);
    xacc = _mm256_xor_epi64(res, add);

XOR (or add) captures all the entropy from the input. In my experiments, XOR works better with multiply (I don't know why, but its just something I've noticed). But otherwise, its conceptually similar to your original "add" code.

----------

I mean, I'm just shooting from the hip here. I don't know how well the above code mixes things around. But those are kind of the steps that I would take to move forward.

----------

Although, if you want me to be 100% honest, anything I'd do "for real" would revolve around _mm_aesenc_si128. That'd be

    xacc[0] = _mm_aesenc_si128(xacc[0], data[0]);
    xacc[1] = _mm_aesdec_si128(xacc[1], data[1]);
    
    // Finalization:
    __m128i finalizationA = _mm_aesenc_si128(xacc[0], const1);
    finalizationA = _mm_aesenc_si128(finalizationA , const2);
    finalizationA = _mm_aesenc_si128(finalizationA , const3);
    finalizationA = _mm_aesenc_si128(finalizationA , const4);
   
    __m128i finalizationB = _mm_aesdec_si128(xacc[1], const1);
    finalizationB = _mm_aesdec_si128(finalizationB , const2);
    finalizationB = _mm_aesdec_si128(finalizationB , const3);
    finalizationB = _mm_aesdec_si128(finalizationB , const4);

    __m128i finalHash = _mm128_xor_epi64(finalizationA, finalizationB);

    // Return bottom 32, 64, or 128-bits of finalHash
 
    // 4 rounds of finalization should be enough. 
    // Testing required to find the minimum for an avalanche condition. It will be more than 2, but probably less than 4...

And... the end. That's it. Its only uses the 128-bit functionality of the execution units, but still gets ILP off of two parallel instances. So 256-bits of state. Its probably a fast function, but I can't be bothered to test it right now. I'm curious if the simpler functionality above would outspeed the 256-bit execution units on Skylake, but I don't have a Skylake box to test on (just AMD Zen).

With AVX512, _mm512_aesenc_epi128 is possible to do this over 512-bits (4x 128-bit parallel instances of AES encryption). But I don't have a Skylake-X machine to test this idea with.

See here for my experiment on AES functionality: https://github.com/dragontamer/AESRand

Cyan4973 · on March 17, 2019

Thanks for suggestions @dragontamer. We are genuine when saying the algorithm is opened to suggestions, and can still change to improve its properties. Let's review yours :

> my goal if I were to go through your code is to cut the state down to 256-bits, while keeping 256-bits of entropy.

The point is, it's a lot more difficult to keep the registers "fully loaded" at maximum entropy _at every step_. By going double size, and accepting that this entropy is leniently dispersed into the larger register, we make our lives a lot easier, which translates into sensible speed gains. To be detailed below

> All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

One core idea of UMAC, which XXH3 is based upon, is that the keys could be any random number (they are supposed to be secret). Forcing them to be odd reduces available space by one bit. Not a very big deal, but still. This could also be ensured by adding an `OR 1` operation on loading the key.

> All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

OK, so that's where the notion of "bit contribution" becomes useful. By making a 32x32=>32 multiplication, and ensuring the multiplier is odd, you have mixed correctly the lowest bit. But the other ones do not contribute to lower bits. At the other extreme, the highest bit only contribute to one (the highest) bit in the resulting product. It's clearly not well mixed.

This can be compensated, with a rotation, or a right shift + add operation, followed by another spreading (another multiplication), and another right shift + add. But all this adds quite a few operations, right in the critical loop, so this is no longer the same speed.

> This line can be precomputed by the way...

That's a great point ! It transforms the shuffle into a load, it's not completely free but is probably faster. More importantly, it requires memory to store the swapped table of keys, which can be troublesome if the size of the table of keys can be customized. I'll look into it, thanks for the suggestion !

> XOR (or add) captures all the entropy from the input. In my experiments, XOR works better with multiply

There are 2 parts here :

- It's not possible to XOR `d` with `res` (which is what happens transitively in your proposal). The whole point of adding d is that it avoids cancelling a contributor, which can happen if there is a multiplication by zero. With XOR, the same impact can still happen, but it requires a multiplication by 1 instead : `(1*d)^d = 0` . Same problem if `d` is subtracted. But with an add operation, cancelling `d` contribution requires a multiplication by `-1`. And that is precisely impossible when doing a 32x32=>64 multiplication. Unfortunately, when doing a 32x32=>32 multiplication, it now becomes actually possible : -1 == 0xFFFFFFFF. So the addition doesn't save the situation, and it's now necessary to change the formula

- Xoring `res` with `acc` seems more tractable. I tried it the early days of XXH3, but unfortunately it proved worse in term of hash quality. At this stage, I'm not too sure why. And maybe later changes indirectly solved an underlying issue, so it might be worth trying again, and see if it proves any better.

> to be 100% honest, anything I'd do "for real" would revolve around _mm_aesenc_si128

I fully agree. Leveraging dedicated hardware capabilities is most likely efficient, and AES is doing a great job at mixing bits. This is more difficult to emulate with simpler instructions. There are only 2 minor issues to be aware of :

- It relies on the presence of a hardware AES module. While not an issue when the target platform basically guarantees its presence, it's a non-trivial problem when targeting broader portability. Platforms without AES, or without access to it (yes, even on Intel, some "systems" can't access the AES instruction, think `WASM` or Kernel space for example), will pay a hefty performance price while using a software backup. It's not necessarily a killing issue, just something to be aware of. xxHash tries to target a very broad portability. This is a "handicap", but with its own benefits.

- AES instructions have excellent throughput, but latency is non negligible. This is especially true on pre-Skylake CPU. Latency is hard to measure, so it's frequently forgotten in benchmarks. In my own tests (on latest generation Intel CPU, so very favorable to AES), using AES instructions ends in the 80M/s region when measuring latency, which is not bad. To be compared with XXH3, which ends in the 110-150M/s region.

Don't read me wrong, using AES is likely a good choice. The only reason XXH3 doesn't use it is that it targets very broad portability, including targets without a hardware AES module.

dragontamer · on March 18, 2019

> OK, so that's where the notion of "bit contribution" becomes useful. By making a 32x32=>32 multiplication, and ensuring the multiplier is odd, you have mixed correctly the lowest bit. But the other ones do not contribute to lower bits. At the other extreme, the highest bit only contribute to one (the highest) bit in the resulting product. It's clearly not well mixed.

True, but this is also true for the top 16-bits and bottom 16-bits of a 64-bit multiply. Its a problem innate to multiplication: the "middle" bit (bit#31) will be best, while the "edge" bits (bit#0 or bit#63) will be awful.

> This can be compensated, with a rotation, or a right shift + add operation, followed by another spreading (another multiplication), and another right shift + add. But all this adds quite a few operations, right in the critical loop, so this is no longer the same speed.

There are quite a few other ways to compensate, which will remain efficient. vpshufb is 1-latency and once-per-clock throughput. If this were ARM, NVidia GPU, or AMD GPU, bit-reversal would work (RBIT on ARM).

But since this is Intel, the fastest way to spread bits around is to do shuffle(state, [0,1,2,3]) (which is my special shorthand for _mm256_shuffle_epi32 (k,0x1b)).

In general, you just need to map the "strongly mixed" bits (bit#31) to the locations which will potentially affect the most bits (bit#0 in the next multiplication). Bit-reversal is best, but almost any vpshufb should do the trick.

> The whole point of adding d is that it avoids cancelling a contributor, which can happen if there is a multiplication by zero

"Cancelling a contributor" is bad, but I strongly disagree with your characterization of it!

"Cancellation of a contributor" only happens when two inputs map to the same output. By the pigeonhole principle, something MUST map to zero. Otherwise, the "zero" output is wasted.

You've identified the singular inputs that result in zero in various ways on 32-bit -> 32-bit multiplication. That's a good thing!! There is ONLY ONE input that maps to zero in my proposed multiplication.

Again: Its not so important that "nothing maps to zero". Its far more important that "exactly one thing maps to zero".

----------

In your algorithm: if d[0] and d[1] are zero, you get a zero as a result. (d[0] * k + d[1] * k == dk[0] == 0 when d[0] == d[1] == 0).

But... (d[0] * k[0] + d[1] * k[1] == 0) defines a whole slew of "potential mappings to zero". Its non-trivial to analyze this, especially with overflow (the "zero" on the right hand side is 64-bit unsigned with overflow).

In a 16-bit case, 0xffff * 0xfffc (d[0] == 0xffff, k[0] == 0xfffc) will be "inverted" by d[1] == 163838 k[1] == 2.

So in this case, we have a violation of the pigeonhole principle: d[0] == d[1] == 0 will map to zero, AND d[0] == 0xffff, k[0] == 0xfffc, d[1] == 163838, k[1] == 2 maps to zero as well.

That's how you "lose bits", with multiple inputs potentially mapping to the same output (in this case: zero, but in general... any input).

With regards to multiplication by an odd number, someone already proved its reversible: https://lemire.me/blog/2017/09/18/computing-the-inverse-of-o...

Which means there is exactly one input that maps to exactly one output for all odd-multipliers.

> Don't read me wrong, using AES is likely a good choice. The only reason XXH3 doesn't use it is that it targets very broad portability, including targets without a hardware AES module.

Hmmmm... in any case, a 32-bit multiply is certainly more portable. I think the throughput / latency issue can be easily solved by increasing the state. 5x128 bits of state to unroll the inner-loop should be sufficient (5-cycles of latency, once-per-cycle of throughput == 640-bit state should be best)

But in any case, you have a point about portability. But note that the AES instruction is implemented on ARM, POWER9, and Intel machines now. So its surprisingly portable, even if systems as a whole (ex: WASM or maybe Java Virtual Machine) don't necessarily support it yet.

Cyan4973 · on March 18, 2019

Following your suggestion, I went ahead and modified the scrambling formula. The crux of the problem is that the secret is allowed to be absolutely anything (it's user provided). As a consequence, it's a bad idea to depend on it for the multiplier. Now, the secret key only function is to be a mask, while the multiplier is under implementation control, guaranteeing a good prime.