Results 1 to 3 of 3

Thread: 64 GB/s aka 16 bytes/cycle universal hashing

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    64 GB/s aka 16 bytes/cycle universal hashing

    "universal hash" is a mathematical term that, from practical grounds, guarantees ideal hashing result. it's impossible to check statistical properties of single fixed hash in mathematically sound way, so universal hash by definition is a family of hashes (i.e. parameterized hash). Kaser&Lemire proved that sum(32*32->64) string hash is universal (here "string" means variable-sized input data as opposed to fixed-width M-bit value). In C++, their hashing algorithm looks as the following:

    Code:
    uint32 data[N]
    uint32 key[N]
    uint64 hash = 0
    for i=0..N-1
      hash += uint64(data[i]) * key[i]
    uint32 result = hash >> 32
    Here key[] is the algorithm parameter (pseudo-random numbers) that has the same size as input data. But we can limit key[] to, say, 128 bytes and use the hash tree to combine partial results (as implemented by VMAC and UMAC algos):

    Code:
    result[0] = hash (data[0..31], key[0..31])
    result[1] = hash (data[32..63], key[0..31])
    ...
    result = hash(result[0..31], key[0..31])
    Alternatively, the last step may employ the sha256 or any other good hash since it will have much less data to process.

    ------------------------------------------------------------------------------------------------------------------------

    So, the only thing that determines speed is the cycle summing results of 32*32->64 multiplications:

    mov eax, data[i]
    mul eax, key[i]
    add ebx, eax
    adc ecx, edx

    it should run in 2 cycles/dword since Pentium Pro, i.e. 8 GB/s for 4 GHz CPU. The cycle speed is limited by those 2 load operations (data[i] and key[i]), so Haswell can run it at 1 cycle/dword speed

    ------------------------------------------------------------------------------------------------------------------------

    Employing SSE2/AVX2 allows to compute 2/4 values simultaneously. Combined with 2 loads per cycle Haswell ability, it should run at 1 cycle per 8/16 bytes (for SSE2/AVX2, respectively), providing abovementioned 64 GB/s hashing speed for the following AVX2 code:

    mov ymm0, [data+i*32]
    vpmuludq ymm0, [key+i*32]
    vpaddq ymm1, ymm0

    mov ymm0, [data+i*32+4]
    vpmuludq ymm0, [key+i*32+4]
    vpaddq ymm1, ymm0

    i=i+1
    mov ymm0, [data+i*32]
    ...

    ------------------------------------------------------------------------------------------------------------------------

    On pre-Haskell cpus we are again limited by load throughput - 2 load operations (data and key) per 16-byte SSE element means 2 cycles. We can reduce amount of loads by parallel processing of six 128-byte blocks (since they are sharing the same key values):

    mov xmm7, [key+i*16] ;load key once

    mov xmm0, [data+i*16]
    pmuludq xmm0, xmm7
    paddq xmm1, xmm0

    mov xmm0, [data+128+i*16]
    pmuludq xmm0, xmm7
    paddq xmm2, xmm0
    ...
    mov xmm0, [data+128*5+i*16]
    pmuludq xmm0, xmm7
    paddq xmm6, xmm0

    i=i+1
    mov xmm7, [key+i*16] ;load the next key...

    this code requires 7 loads (6 data values and one key value) per 6*8 bytes of input data, so 4 GHz Core2+ CPU will run at 4*8*6/7 = 27 GB/s using SSE2, saturating its memory bandwidth

    ------------------------------------------------------------------------------------------------------------------------

    It's important to note that all these code snippets will work in x86 (32-bit) mode as well as x64. So, for practical usage we only need the way to generate somewhat around 128..4096 bytes of pseudo-random data from 32..256 bit key provided by the library caller. It can be implemented by calculating AES or SHA256 function from the key, key+1... values (i.e. using AES in CTR mode), again already implemented by the VMAC algo

    Strictly speaking, this algorithm returns 32-bit hash value. For wider result, we can either calculate and later combine multiple 32-bit hashes (the same data[i] multiplied to several key[K][i] values) with the proportional speed degradation or use some form of mixing up intermediate data losing universal hash property but keeping the same high speed. I expect that properly designed mixing step would provide us with good practical hash like it is done in the CityHashCrc256 for example

    ------------------------------------------------------------------------------------------------------------------------

    Conclusion: these code snippets may be used to implement alternative to xxHash/CityHash with the following properties:
    • fast in both 32-bit and 64-bit code:
    • __ 2 bytes/cycle at least on any CPU since Pentium Pro
    • __ 7 bytes/cycle at least on any Intel CPU since Core2
    • __ 16 bytes/cycle on Haswell, 32 bytes/cycle on Skylake with AVX-512
    • 32-bit result by default, wider results with proportional speed degradation
    • almost math-proven guarantees of hash quality (real math guarantees require true random key)
    • wider results with the same speed as 32-bit one without speed degradation - needs further development and no math guarantees anymore
    • key generation and results combining needs some good-quality algos, but i believe that AES and SHA256 can run this duty
    • the algorithm is parameterized meaning that you can calculate as much independent hashes as you need

    So, comparing it to the best non-crypto hashes i know, xxHash/CityHash, i expect this hash to have better speed but the following drawbacks:
    • much more compiled code since we need to employ AES and SHA256 implementations
    • higher cache occupation (those 128..4096 bytes of key[])
    • wide results (>32 bits) without speed degradation needs further development
    Last edited by Bulat Ziganshin; 9th April 2014 at 18:22.

  2. The Following 3 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (9th April 2014),Garen (10th April 2014),Paul W. (10th April 2014)

  3. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Doesn't limiting the key destroy the universal property of the hash? For example, if you swap data[i] with data[i+32], you have a collision. Likewise if you have data[i]+C, data[i+32]-C.

  4. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I have a feeling that if the paper calls for the same number of random key bits as input bits, that's probably critical to making it work.

    Why don't you email Lemire and describe the situation? Maybe he will give you a recommendation.
    Last edited by nburns; 10th April 2014 at 01:51.

Similar Threads

  1. reflate - a new universal deflate recompressor
    By Shelwien in forum Data Compression
    Replies: 119
    Last Post: 28th March 2018, 22:52
  2. Replies: 32
    Last Post: 24th September 2013, 00:57
  3. Replies: 7
    Last Post: 6th September 2012, 03:23
  4. Compress any file to 4 bytes
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 28th June 2011, 07:11
  5. Universal Archive Format
    By Bulat Ziganshin in forum Data Compression
    Replies: 1
    Last Post: 9th July 2008, 00:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •