Results 1 to 6 of 6

Thread: Filling a large Zobrist hash table

  1. #1
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Filling a large Zobrist hash table

    Hello all,
    For one program I am developing I need a large Zobrist hashing table (2^20 elements).

    The problem I face is all the examples I can find assume smaller tables:
    Chess 13 values * 64 positions = 832 elements.
    Go = 3 values * 361 positions = 1083 elements.

    But my program is not a game and needs to use 256 values * 4096 array = 1048576 elements.

    The problem in filling the table is that most schemes just pick random values for the elements. Using 32 bit values there is very little chance of bad elements being chosen in the above examples. But I will be generating 2^20 elements and I have my doubts about conflicting values being used.

    So far I have two rules to follow when I fill the table.

    1) No element in the hash table can have the value of zero.

    2) No element value can exist in more than one entry in the hash table.

    Can anyone tell me if there is any other rules I need to follow?

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    What application is this for? The only reason for using a Zobrist hash is when you have a lot of similar inputs where you can quickly compute a list of hashes from a list of edits.

    All hashes have collisions. You can't do much better than just picking random values for your tables unless your data has some special structure.

  3. #3
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Data Dep

    Quote Originally Posted by Matt Mahoney View Post
    What application is this for? The main reason for using a Zobrist hash is when you have a lot of similar inputs where you can quickly compute a list of hashes from a list of edits.

    All hashes have collisions. You can't do much better than just picking random values for your tables unless your data has some special structure.
    I am writing a block-device driver for Haiku-OS that includes data deduplication for each 4K block of data.

    For this I want a fast hashing routine with minimal code.

    This is a work in progress, so I will consider and test other hashing methods but this looks like a good one to use at present.

    Note 1: I have my own reasons for using 32 bit hashes. This is not an enterprise product and does not need 64-256 bit hashes to be useful.

    Note 2: If this sound like something I was talking about 2 years ago, you are right. My mother was very sick since January 2012 and passed on in March 2013. I was too busy with taking care of her to work on the code at that time. I am now starting from the beginning again in-order to review all the old work first.

    Note 3: Yes, I know some hashing methods read the data 32/64 bits at a time and will hash faster, but since I do not want any machine code I will need to test/time the 'C' versions to see if they are a better choice. But at the moment I want to learn more about Zobrist hashing as it looks very useful for a number of purposes.

    Thanks for any help given.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    to answer your question one need to know better how you plan to use it. overall, Matt is right - you have small chances of duplicate hashes so you should be ready to deal with it. It's nature of ANY hashing

  5. #5
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    40
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I am not sure what else you need to know. I thought I made it clear in my last message.

    I wrote some software to test results of different ways of filling the table. Just using random numbers is not good enough once you start using large tables.

    First there is the chance that more than one element in the table will have the same value, this of-course increases the probability of collisions. Thus the code to fill the table must avoid that condition.

    Second, a element with a value of zero does nothing to change the hashing value, so no element can have that value.

    Since I last posted, found the values entered into the table by either using just random values or a simple linear filler do not have equal probabilities in the number of bits flipped. So I am rewritting my hash filler code to even out the probabilities.

    Okay, maybe I am not being so clear. I think I need to write it up and post my goals a little later.

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    well, that's our code space and what's the hash size? if code space is larger then you will have some duplicates anyway. by ensuring that there are no zeroes among basis hashes you just move duplicates to another positions. f.e. it will increase probability that hash(bit1+bit2) is zero

Similar Threads

  1. Fast CRC table construction and rolling CRC hash calculation
    By Bulat Ziganshin in forum Data Compression
    Replies: 10
    Last Post: 27th May 2018, 03:32
  2. Extremely fast hash
    By Bulat Ziganshin in forum Data Compression
    Replies: 36
    Last Post: 23rd August 2013, 21:25
  3. LZSS with a large dictionary
    By encode in forum Data Compression
    Replies: 31
    Last Post: 31st July 2008, 21:15
  4. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 01:57
  5. Hash Zip
    By Black_Fox1 in forum Forum Archive
    Replies: 6
    Last Post: 4th March 2007, 18:12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •