# Thread: 64 GB/s aka 16 bytes/cycle universal hashing

1. ## 64 GB/s aka 16 bytes/cycle universal hashing

"universal hash" is a mathematical term that, from practical grounds, guarantees ideal hashing result. it's impossible to check statistical properties of single fixed hash in mathematically sound way, so universal hash by definition is a family of hashes (i.e. parameterized hash). Kaser&Lemire proved that sum(32*32->64) string hash is universal (here "string" means variable-sized input data as opposed to fixed-width M-bit value). In C++, their hashing algorithm looks as the following:

Code:
```uint32 data[N]
uint32 key[N]
uint64 hash = 0
for i=0..N-1
hash += uint64(data[i]) * key[i]
uint32 result = hash >> 32```
Here key[] is the algorithm parameter (pseudo-random numbers) that has the same size as input data. But we can limit key[] to, say, 128 bytes and use the hash tree to combine partial results (as implemented by VMAC and UMAC algos):

Code:
```result = hash (data[0..31], key[0..31])
result = hash (data[32..63], key[0..31])
...
result = hash(result[0..31], key[0..31])```
Alternatively, the last step may employ the sha256 or any other good hash since it will have much less data to process.

------------------------------------------------------------------------------------------------------------------------

So, the only thing that determines speed is the cycle summing results of 32*32->64 multiplications:

mov eax, data[i]
mul eax, key[i]
add ebx, eax
adc ecx, edx

it should run in 2 cycles/dword since Pentium Pro, i.e. 8 GB/s for 4 GHz CPU. The cycle speed is limited by those 2 load operations (data[i] and key[i]), so Haswell can run it at 1 cycle/dword speed

------------------------------------------------------------------------------------------------------------------------

Employing SSE2/AVX2 allows to compute 2/4 values simultaneously. Combined with 2 loads per cycle Haswell ability, it should run at 1 cycle per 8/16 bytes (for SSE2/AVX2, respectively), providing abovementioned 64 GB/s hashing speed for the following AVX2 code:

mov ymm0, [data+i*32]
vpmuludq ymm0, [key+i*32]
vpaddq ymm1, ymm0

mov ymm0, [data+i*32+4]
vpmuludq ymm0, [key+i*32+4]
vpaddq ymm1, ymm0

i=i+1
mov ymm0, [data+i*32]
...

------------------------------------------------------------------------------------------------------------------------

On pre-Haskell cpus we are again limited by load throughput - 2 load operations (data and key) per 16-byte SSE element means 2 cycles. We can reduce amount of loads by parallel processing of six 128-byte blocks (since they are sharing the same key values):

mov xmm7, [key+i*16] ;load key once

mov xmm0, [data+i*16]
pmuludq xmm0, xmm7
paddq xmm1, xmm0

mov xmm0, [data+128+i*16]
pmuludq xmm0, xmm7
paddq xmm2, xmm0
...
mov xmm0, [data+128*5+i*16]
pmuludq xmm0, xmm7
paddq xmm6, xmm0

i=i+1
mov xmm7, [key+i*16] ;load the next key...

this code requires 7 loads (6 data values and one key value) per 6*8 bytes of input data, so 4 GHz Core2+ CPU will run at 4*8*6/7 = 27 GB/s using SSE2, saturating its memory bandwidth

------------------------------------------------------------------------------------------------------------------------

It's important to note that all these code snippets will work in x86 (32-bit) mode as well as x64. So, for practical usage we only need the way to generate somewhat around 128..4096 bytes of pseudo-random data from 32..256 bit key provided by the library caller. It can be implemented by calculating AES or SHA256 function from the key, key+1... values (i.e. using AES in CTR mode), again already implemented by the VMAC algo

Strictly speaking, this algorithm returns 32-bit hash value. For wider result, we can either calculate and later combine multiple 32-bit hashes (the same data[i] multiplied to several key[K][i] values) with the proportional speed degradation or use some form of mixing up intermediate data losing universal hash property but keeping the same high speed. I expect that properly designed mixing step would provide us with good practical hash like it is done in the CityHashCrc256 for example

------------------------------------------------------------------------------------------------------------------------

Conclusion: these code snippets may be used to implement alternative to xxHash/CityHash with the following properties:
• fast in both 32-bit and 64-bit code:
• __ 2 bytes/cycle at least on any CPU since Pentium Pro
• __ 7 bytes/cycle at least on any Intel CPU since Core2
• __ 16 bytes/cycle on Haswell, 32 bytes/cycle on Skylake with AVX-512
• 32-bit result by default, wider results with proportional speed degradation
• almost math-proven guarantees of hash quality (real math guarantees require true random key)
• wider results with the same speed as 32-bit one without speed degradation - needs further development and no math guarantees anymore
• key generation and results combining needs some good-quality algos, but i believe that AES and SHA256 can run this duty
• the algorithm is parameterized meaning that you can calculate as much independent hashes as you need

So, comparing it to the best non-crypto hashes i know, xxHash/CityHash, i expect this hash to have better speed but the following drawbacks:
• much more compiled code since we need to employ AES and SHA256 implementations
• higher cache occupation (those 128..4096 bytes of key[])
• wide results (>32 bits) without speed degradation needs further development 2. ## The Following 3 Users Say Thank You to Bulat Ziganshin For This Useful Post:

Cyan (9th April 2014),Garen (10th April 2014),Paul W. (10th April 2014)

3. Doesn't limiting the key destroy the universal property of the hash? For example, if you swap data[i] with data[i+32], you have a collision. Likewise if you have data[i]+C, data[i+32]-C. 4. I have a feeling that if the paper calls for the same number of random key bits as input bits, that's probably critical to making it work.

Why don't you email Lemire and describe the situation? Maybe he will give you a recommendation. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•