Results 1 to 21 of 21

Thread: Smac-Hash : 64 bits to 32 bits, non-cryptographic

  1. #1
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts

    Smac-Hash : 64 bits to 32 bits, non-cryptographic

    I have just created a hash formula for the Match model of Smac 1.16.
    As I'm a bit lazy, I simply used 2 SSE2 instructions, Rcpps & Sqrtsd.
    Surprisingly, my tests on Silesia Corpus show that, overall, it outperforms FNV-1 in terms of dispersion, but not with all files though.
    I say 'surprisingly' because calculating square roots of negative values should have issued a lot of Indefinite values, and thus as many collisions. That is a mistery to be solved...
    Click image for larger version. 

Name:	Silesia test.png 
Views:	295 
Size:	11.4 KB 
ID:	2387
    It seems to give better results on text files : on Enwik8, I achieved 21836249, against 21871800 with FNV1.


    ;Smac-hash, 64 bits to 32 bits
    ;We assume that low qword of xmm0 contains the 8 bytes to be hashed

    rcpps xmm1,xmm0 ;xmm1=reciprocal of 2 low-dwords of xmm0
    sub esp,8 ;allocate a qword in stack
    sqrtsd xmm1,xmm1 ;xmm1=square root of low-qword of xmm1
    movq [esp],xmm1 ;store result in stack
    pop eax ebx ;then in eax & ebx
    xor eax,ebx ;eax=32 bits hash


    This second version appends a shift-right instruction before calculating the square root : the bit of sign will be erased, so xmm1 will always be positive. It gets a better result on enwik8 (2183473, but I have not tested it thoroughly yet.

    ;Smac-hash, 64 bits to 32 bits
    ;We assume that low qword of xmm0 contains the 8 bytes to be hashed

    rcpps xmm1,xmm0 ;xmm1=reciprocal of 2 low-dwords of xmm0
    sub esp,8 ;allocate a qword in stack
    psrlq xmm1,1 ;erase bit of sign
    sqrtsd xmm1,xmm1 ;xmm1=square root of low-qword of xmm1
    movq [esp],xmm1 ;store result in stack
    pop eax ebx ;then in eax & ebx
    xor eax,ebx ;eax=32 bits hash


    Also, you could use sqrtpd to work on a 128 bits value; you might replace rcpps xmm1,xmm0 with rcpps xmm1,dqword [Source].
    Feel free to use and modify...

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I doubt Rcpps & Sqrtsd will provide repeatable results on different processors and even on different compilers or compilation settings.

    I think FNV-1 is good enough for start. If you want something more sophisticated, look for a recent topic with TarsaLZP. Shelwien made a variation that uses his rolling CRC32 algorithm which should fit nicely with match model.

  3. #3
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    After further testing, first version should be avoided : the square root function will issue the same NAN value for all negative entries. Second version corrects this problem though, and looks Ok.
    I had originally used FNV-1 for my Match model, and I've made some tests with CRC32, S-Box, to see if I could achieve some better results, but FNV-1 remained better.
    Curiously, this weird hash is getting better results, but this may be due to the way my match Model is implemented; I have to clear this out.
    Last edited by Jean-Marie Barone; 24th July 2013 at 00:14. Reason: typo

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I'm working on a hashing project at the moment. I'm not sure how to interpret your results, such as on enwik8. I haven't checked out your hash function yet, and I haven't really looked much at work other people have done with hashing, especially applied to full text. I'm trying it my way, but I really don't know what other techniques exist, so I don't know how to compare.

    I tracked down the formula to calculate the expected number of hash collisions, under the assumption that hashes are random and evenly distributed. Here's how I wrote it in C:


    double D = hashtablesize;
    double n = trials;
    double phi = (D - 1.0)/D;
    double expected_collisions = n - D * (1 - pow(phi, n));


    D is the number of possible hash values. n is the number of objects being hashed. Collisions are defined as instances when you compute a hash for some object and find that it's already been assigned to a different object.

    My experimental results match the numbers I get from this formula with high precision.

    I think it would provide useful context to give results of hashing as number of collisions observed compared with the expected number for a random assignment of hashes.

  5. #5
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    Thanks nburns, this might be helpful. My own testing method was much trivial : I have tested it with a different set of files, and compared with FNV-1 results. I don't really plan to analyze it deepfully; I just noticed it was working good for me, especially with text files, so I intend to use it.

    I have read some documentation regarding Floating-point determinism.
    It turns out using rcpss instruction was a bad idea, as it is not fully IEEE-754 compliant. Actually, at leat 4 instructions must be avoided : rcpss, rcpps, rsqrtps, rsqrtss because they calculate estimates, and might give different results between different processors. A guy on a forum (I can't find back the link) had made some tests with rcpps on different procs, and the results were sometimes a bit different, even between Intel Core 2 and Core 2 duo. Using square root is ok though.
    So I had to find a substitute for the rcpss instruction in my Match model hash with something like this:

    ;Reset SSE2 control/status register (set rounding mode to Nearest)
    push eax ; allocate a qword in stack
    stmxcsr [esp] ;store MXCSR register in stack
    mov word ptr [esp],1F80H ;default settings
    ldmxcsr [esp] ;load MXCSR register from stack
    pop eax ;clear stack

    ;Initiate xmm2 with a mask
    pcmpeqd xmm2,xmm2 ;fill xmm2 with 1
    pslld xmm2,8 ;xmm2=FFFFFF00FFFFFF00H

    ;Hash computation assuming datas to hash are located in low-qword of xmm0
    cvtdq2ps xmm1,xmm0 ;xmm1=4 xmm0 dwords converted to float
    pand xmm1,xmm2 ;set bytes 0 & 4 to naught
    sub esp,8 ;allocate a Qword in stack
    psrlq xmm1,1 ;erase the bit of sign of xmm1
    sqrtsd xmm1,xmm1 ;xmm1=square root of low-qword of xmm1
    movq [esp],xmm1 ;store result in stack
    pop eax ebx ;then in eax & ebx
    xor eax,ebx ;eax=32 bits hash


    Results on enwik8 (21833024) are a bit better than second version, but results on Silesia are a bit worse, altough still better than FNV-1.
    Anyway, I still have the feeling that something useful can be worked out using a small chain of SSE instructions.

  6. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    The kind of hash I'm playing with is an extension of a rolling hash in which bytes can be added and removed in any order, so you can add as many bytes as you want, remove some, and add more, in any order.

    It's based on the Randomized Karp-Rabin implementation from Daniel Lemire at http://code.google.com/p/ngramhashing/. However, it's simpler, because I started eliminating anything unnecessary, and pared it back to one header in C. I made changes to the algorithm, so it's not really Randomized Karp-Rabin any more.


    #define B 37U
    #define Binv 290200493U

    #define HASHDATA_EMPTY { 0U, 1U }

    struct hash_data_t {
    uint32_t value;
    uint32_t Bipow;
    };

    #define hash_add_right(hdata, c) \
    (hdata.Bipow *= Binv, hdata.value = (hdata.value * Binv + (unsigned char)c))
    #define hash_remove_left(hdata, c) \
    (hdata.Bipow *= B, hdata.value = (hdata.value - hdata.Bipow * (unsigned char)c))
    #define hash_mod_2k(hdata, numbits) \
    (((hdata.value >> (32 - (numbits))) ^ hdata.value) & ((1U << (numbits)) - 1U))


    Adding bytes uses the simple formula:

    hashvalue = hashvalue * B + c

    c is the input byte, and B was originally the number 37. One of my innovations was to replace 37 with its multiplicative inverse (mod 2^31), which is probably also prime, but I haven't checked it. I removed the randomization part, which substituted each byte with a random integer of size equal to the hash. I think the random numbers were counterproductive, because by making the low bits of the hash random, it fairly guarantees that some patterns will repeat, and since those bits receive little or no mixing, that makes those bits a source of collisions. The random numbers could contribute no entropy to the hash over what was in the original bytes. I think the only real purpose they served was to spread the effects of each input byte across the entire hash faster than it could spread through repeated multiplication by 37. That's not so much of an issue once the hash gets going, but I had collisions after removing them when using the hash on short words, where the hash was almost completely zero after adding one or two bytes. However, I had the idea to try the multiplicative inverse of 37 instead of 37, which is the modular equivalent to a regular multiplicative inverse, and is uniquely determined for any prime number mod n in a field. That's the number Binv in the code, and just by looking at this number in binary, you can see it has many more 1 bits than 37 does, and it is much larger relative to the size of the hash. The 1 bits all become shift-and-add operations when doing binary multiplication, so lots of 1s imply rapid mixing (too many will cancel out). The calculators in both Windows and OSX have programmer modes that show the current number in binary, and it's educational and simple to watch what happens in this hash through repeated multiplications by a prime. Experimentally, replacing 37 with its inverse seemed to solve any lack of mixing caused by the lack of random numbers, and in some cases reduced collisions. I didn't look too deeply into the significance of the number 37, which was built in to Lemire's implementation, but I have a feeling that 37 and its inverse share common properties, and exchanging them is a mathematically reasonable thing to do.

    I'm not that experienced with testing hash functions, but this one seems like a pretty good one, and its simple.
    Last edited by nburns; 27th July 2013 at 19:44.

  7. The Following User Says Thank You to nburns For This Useful Post:

    Bulat Ziganshin (6th February 2015)

  8. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    nburns, this sort of rolling hash is well known and used at least since 90's. many of us here at encode.ru use 123456791 prime number. i also posted simple program that finds prime numbers just for such sort of hashes

  9. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    nburns, this sort of rolling hash is well known and used at least since 90's. many of us here at encode.ru use 123456791 prime number. i also posted simple program that finds prime numbers just for such sort of hashes
    It's Karp-Rabin, as I said, and it's derived from Daniel Lemire's. The major thing I've done is to make it expandable, which I haven't seen done by anyone else. A standard rolling hash has a fixed window, but this one is more like a queue that holds any number of elements. The idea is to apply hashing to a broader range of problems, and the experiment is ongoing.

    I've also done some modifications, like entirely removing the randomized part, and I've experimented with using the modular multiplicative inverse of a prime, rather than choosing a small prime or choosing one that just looked good without mathematical justification. Using the inverse of 37 looks remarkably like dividing rather than multiplying in terms of how it affects the bits, which is a curious thing. This function was designed by Karp and Rabin as a randomized hash function, that was dependent on random numbers, but now it's deterministic. And it still seems to work, actually, pretty well.

    I brought it up mainly for the sake of discussing hashing, and it's a platform for experimentation more than anything else.

    I think I read your message where you posted the prime number finder. I think that was a different kind of Rabin-Karp, and I have a sense that there are many Rabin-Karps. This one doesn't have a requirement to find prime numbers. The modulus is any power of 2, and it came hard-coded with a small prime that's suitable for any mod.
    Last edited by nburns; 27th July 2013 at 21:24.

  10. #9
    Member Jean-Marie Barone's Avatar
    Join Date
    Oct 2012
    Location
    Versailles, France
    Posts
    60
    Thanks
    15
    Thanked 4 Times in 3 Posts
    New hashes are always welcome! I'm sure this rolling hash can prove itself useful for someone, somewhere.
    For the moment, I'm still happy with my weird-looking hash

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Wolfram Alpha says that inverse of 37 mod 2^31 is composite. Not sure that it matters.
    http://www.wolframalpha.com/input/?i...+mod+2%5E31%29

  12. #11
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Wolfram Alpha says that inverse of 37 mod 2^31 is composite. Not sure that it matters.
    http://www.wolframalpha.com/input/?i...+mod+2%5E31%29
    Interesting. There could be some subtle consequence, but mostly what matters is that it's coprime with 2. I used Alpha to calculate it to begin with.

  13. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by nburns View Post
    The major thing I've done is to make it expandable, which I haven't seen done by anyone else. A standard rolling hash has a fixed window, but this one is more like a queue that holds any number of elements.
    now i got it! i think that it may provide better results than zpaq idea of multiplication by even number. we should perform some tests

  14. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I guess it depends on the rules for changing the window size. To find dedupe fragment boundaries, zpaq uses an order 1 predictor and computes a rolling hash that depends on the last 32 mispredicted bytes plus any bytes in between. To do this it uses a 32 bit hash and multiplies by an even number when a byte is mispredicted and by an odd number not a multiple of 4 when the byte is predicted. The actual code is like:
    Code:
      if (o1[c1]==c) h=(h+c+1)*314159265u;
      else h=(h+c+1)*271828182u;
      o1[c1]=c;
      c1=c;
    where c is the current byte, c1 is the previous byte, h is the 32 bit hash, and o1[256] is the order 1 prediction table.

  15. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i know this code very well (and employ it in the srep -m1 mode). its drawback, obviously, is that older bytes affect only a few highest bits of h. so nburns idea may provide us with better hashing

  16. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yes, that is a problem, or a feature, depending on what you do with it. The low n bits of the hash depend on only a window of the last n mispredicted bytes. For zpaq, it makes fragment boundary decisions based on the high bits of h.

  17. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i think that it makes decision worser because f.e. oldest byte in the window can change only one bit of hash. it would be a problem for regular hashing (f.e. used in match finder), so i expect that it reduces quality of hashing (anf therefore chunking) in zpaq too

  18. #17
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i think that it makes decision worser because f.e. oldest byte in the window can change only one bit of hash. it would be a problem for regular hashing (f.e. used in match finder), so i expect that it reduces quality of hashing (anf therefore chunking) in zpaq too
    I think Matt designed it that way on purpose. It seemed like a questionable thing to do, but I haven't looked into the problem+solution enough to really judge.

    It's been a while since I looked at this hash, so I'm not sure if this version was the one that passed all the SMHasher tests. The final version should be around somewhere. It might be on, e.g., Google Code.

    P.S. When I wrote that post, I had a misunderstanding related to composite finite fields. I'm doing normal integer arithmetic there mod 2^31. That's not the same as the Galois Field with 2^31 elements. However, the math still worked out.

    P.P.S. I think I developed the hash more in this thread: http://encode.ru/threads/1747-Extremely-fast-hash

    There may be a better version there.
    Last edited by nburns; 10th February 2015 at 02:49.

  19. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    standard rabin-karp scheme involves rolling hash of fixed, say 48 last bytes. zpaq hashing includes variable number of bytes including last 32 non-predicted ones. the idea, as i understand it, is to include enough "entropy" (since predicted bytes has low entropy). zpaq multiplies hash by even number in order to suppress influence of bytes preceding those 32 non-predicted ones, but side-effect is that older indluencing bytes also has less influence. may be is is desirable effect, i don't know. but if not, your scheme of multiplicaion will improve results. well, it's easier to check than discuss

  20. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    at http://encode.ru/threads/1747-Extremely-fast-hash you tried to convince us that is is a good general-pupose hash, but rolling hash obvously can't be good for anything else

  21. #20
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    standard rabin-karp scheme involves rolling hash of fixed, say 48 last bytes. zpaq hashing includes variable number of bytes including last 32 non-predicted ones. the idea, as i understand it, is to include enough "entropy" (since predicted bytes has low entropy). zpaq multiplies hash by even number in order to suppress influence of bytes preceding those 32 non-predicted ones, but side-effect is that older indluencing bytes also has less influence. may be is is desirable effect, i don't know. but if not, your scheme of multiplicaion will improve results. well, it's easier to check than discuss
    If you are working mod 2^k and you multiply by 2, the whole thing shifts left and you lose one bit from the top. The same applies if you multiply by any x=2*a. ... I guess the purpose of Matt's scheme is to let the older symbols fade out gradually, rather than removing them all at once. It would be interesting to see evidence for doing that. But it seems easier to envision the results when you hash the last k symbols (without trying to weight them).
    Last edited by nburns; 10th February 2015 at 06:08.

  22. #21
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    at http://encode.ru/threads/1747-Extremely-fast-hash you tried to convince us that is is a good general-pupose hash, but rolling hash obvously can't be good for anything else
    Why is that obvious? I'd counter that there's nothing obvious to prevent you from making a good hash that rolls.

    Empirically, the collision rates matched the ideal on all the data I fed it. It also passed all the tests on SMHasher.
    Last edited by nburns; 10th February 2015 at 06:24.

Similar Threads

  1. UTF-8 transformation to 7 bits format, helps compression
    By caveman in forum Data Compression
    Replies: 3
    Last Post: 14th January 2013, 14:04
  2. found 6 bits redundancy in sharnd_challenge.dat
    By ddfox in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 4th June 2010, 22:30
  3. Frequent Bits Elimination Method - suggest please
    By Scientist in forum Data Compression
    Replies: 14
    Last Post: 28th September 2009, 18:30
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 08:53
  5. Compressing small bits of data
    By fredrp in forum Data Compression
    Replies: 9
    Last Post: 28th April 2009, 22:27

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •