Results 1 to 9 of 9

Thread: Esssence od deduplication (CDC)

  1. #1
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Esssence od deduplication (CDC)

    I am analyzing https://github.com/baixiangcpp/FileCDC
    it explained to me, that each found block has the same last bits of fingerprint (nice trick!). In this example we check last 13 bits, that minimal block is about 8 kiB, but blocks can be larger because not always stop when 13 bits is equal some value.
    :
    Code:
            while (cur < tail) {
                fingerprint = (cur == block_min_sz - 1) ?
                    finger(buf + cur - BLOCK_WIN_SZ + 1, BLOCK_WIN_SZ) :
                    rolling_finger (fingerprint, BLOCK_WIN_SZ, 
                                    *(buf+cur-BLOCK_WIN_SZ), *(buf + cur));
    BLOCK_WIN_SZ to 48,

    I don't understand, that in gear hash computing are added bytes to hash but not computing bytes leaving window
    Last edited by AndrzejB; 3rd July 2019 at 07:38.

  2. #2
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Gear not works:
    Code:
    #include <iostream>
    #include <fstream>
    #include <random>
    uint64_t gear_table[256];
    int main()
    {
     std::mt19937_64 rng(0);
     for (int i = 0; i < 256; i++)
     gear_table[i] = rng();
     int length = 2048;
     unsigned char* buffer = new unsigned char[length];
     memset(buffer, 0, length);
     
     uint64_t gear = 0;
     for (int i = 0; i < length; i++)
     {
      gear = (gear << 1) + gear_table[buffer[i]];
      printf("%llx\n", gear);
     }
     delete buffer;
    }
    More and more least significant bits are set and not changed and finally is d717c83a34be23c2 and not changes

    --->https://github.com/lemire/rollinghashcpp
    Last edited by AndrzejB; 3rd July 2019 at 11:10. Reason: way to solution

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    FileCDC code looks overcomplicated. Overall CDC idea is to split data into chunks at the points where hash&0xFFF==0 and then find duplicates among these chunks:

    Code:
    hash_table ht
    chunk_start = 0
    uint32 hash = 0
    
    for pos in BLOCK_WIN_SZ .. length:
      hash = update(hash,pos)
      if hash & 0xFFF == 0:
        chunk_hash = hash2(chunk_start, pos)
        if (old_chunk = ht[chunk_hash]) isn't null:
          -- Block at old_chunk is a possible duplicate of the [chunk_start, pos) block
          ...
        ht[chunk_hash] = chunk_start = pos
    With hash & 0xFFF == 0 check, AVERAGE chunk size is 4 KB, but it can be anything from 1 to infinity.


    Now, the update function. We can use either CRC or polynomial hash. Polynomial hash:

    hash(pos) = (....(buf[pos-BLOCK_WIN_SZ] * K + buf[pos-BLOCK_WIN_SZ+1]) * K + ....) * K + buf[pos-1]

    Your need to compute K2 = pow(K,BLOCK_WIN_SZ) first, and then

    update(hash,pos) ::= hash*K -
    buf[pos-BLOCK_WIN_SZ]*K2 + buf[pos-1]
    Last edited by Bulat Ziganshin; 3rd July 2019 at 21:13.

  4. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Shelwien (3rd July 2019)

  5. #4
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    update(hash,pos) ::= hash*K - buf[pos-BLOCK_WIN_SZ]*K2 + buf[pos-1]
    Previously I thought that one hash depends of all chunk bytes : from kilobytes back to pos.
    But is depend of only
    BLOCK_WIN_SZ (often = 48 bytes)
    It means that we compute two hashes: one Rabin fingerprint of last 48 bytes and other (for example SHA) of chunk?
    If fingerprint depend only of for 48 bytes this may be problem - If we have 1000 bytes with zeros, all fingerprints will identical? and both: no
    hash & 0xFFF == 0 or always
    hash & 0xFFF == 0.

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here I have a simple CDC-based dedup utility: http://nishi.dreamhosters.com/u/fmarep_v0.rar

    Also here's an even simpler remote diff based on same idea: http://nishi.dreamhosters.com/u/fma-diff_v0.rar

    Main differences:
    1) I use a rolling version of crc32 which is a better hash
    2) hash<threshold is used for fragment boundary, rather than (hash&bits)==0, the former is more flexible

  7. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    It means that we compute two hashes: one Rabin fingerprint of last 48 bytes and other (for example SHA) of chunk?
    yes, it's my fault. You need to compute hash of the entire chunk. We have two approaches for that:
    1) compute cryptohash to ensure that same hashes means same chunks
    2) compute ordinary hash, but keep data in memory and use memcmp for final decision. This way, you can also find byte-exact match boundaries

    If fingerprint depend only of for 48 bytes this may be problem - If we have 1000 bytes with zeros, all fingerprints will identical? and both: no
    in order to deal with large zero sequences we can break from the cycle when `hash > BOUNDARY || chunk_len > MAX_CHUNK_LEN'

    I use a rolling version of crc32 which is a better hash
    AndrzejB, look at this text to learn more about fascinating math behind it

  8. #7
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Before reading thios text, short question, I read mattmahoney.net/dc/dce.html#Section_527
    It is very simple algorithm
    Code:
    uint32_t h = 0; // rolling hash value int c; // input byte const int K = 876543210; // a random 32 bit even number not a multiple of 4 while ((c = getchar()) != EOF) { h = (h + c + 1) * K; if (h >= 0xfffc0000) mark_boundary(); }
    But where is "
    window of 32 bytes"? 32 bytes because const K as polynomial has degree 32?
    Because sliding window Is not posssible making very short chunks? I hear about "Sample byte": https://www.usenix.org/legacy/event/...s/aggarwal.pdf

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    K value is even so shifts out one bit of context on every data byte.
    hash is 32-bit, so there's no dependency after 32 bytes.

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    But where is "window of 32 bytes"?
    like everything else, there are slightly different versions of rolling hash formula

    the one we described here is the most popular one. in DCE, Matt tried to minimize the sample code, employing polynomial hash with K divisible by 2, but not divisible by 4. This K ensures that on each cycle the hash value is shifted one bit left, thus bytes older than 32 positions are completely shifted out of the hash value

    in zpaq, he used improved version of this trick to provide the best CDC rolling hash formula that we know

  11. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Shelwien (3rd July 2019)

Similar Threads

  1. FastBackup: yet another deduplication engine
    By Bulat Ziganshin in forum Data Compression
    Replies: 109
    Last Post: 3rd January 2015, 21:04
  2. Data deduplication
    By Lasse Reinhold in forum Data Compression
    Replies: 79
    Last Post: 18th November 2013, 08:49
  3. Deduplication X-Files
    By Bulat Ziganshin in forum Data Compression
    Replies: 18
    Last Post: 21st May 2013, 21:21
  4. native deduplication in Windows 8 x64
    By jimbow in forum Data Compression
    Replies: 6
    Last Post: 30th October 2012, 23:57
  5. A Microsoft study on deduplication
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 5th May 2011, 18:15

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •