Results 1 to 19 of 19

Thread: Deduplication X-Files

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    Deduplication X-Files

    Programs:

    Documentation:

    Rolling hashes:

    Strong online hashes (may be stored inside updatable archive):
    • Blake2: 3.3-5.5 cpb
    • BMW
    • SHA-1: 10 cpb

    Strong offline hashes (should not be used for further updates):
    Last edited by Bulat Ziganshin; 20th May 2013 at 21:32.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Being the only closed-source algorithm in the group, EXDUPE is the most intriguing one. so, i've looked into exdupe-compressed file and found that it has text-labeled sections. probably it saves hashes of deduplication chiunks in the HASHTABL section. For 100 mb input file, it has a little less than one 49-byte entry per each 8 kb of input data. This leads me to conclusion that EXDUPE implements content-defined chunking with 8kb average chunk size

    I've compared EXDUPE 0.3.6 with srep4 using 8kb chunk size (note that my HDD needs 200-220 seconds to read the vhd file):

    C:\>timer exdupe.exe -o -x0 -t12 -v5 z:\vhd nul
    COMPRESSED 98,420,528,640 bytes in 1 file(s) into 74,645,544,571 bytes
    Kernel Time = 29.780 = 00:00:29.780 = 14%
    User Time = 849.144 = 00:14:09.144 = 408%
    Process Time = 878.925 = 00:14:38.925 = 422%
    Global Time = 208.059 = 00:03:28.059 = 100%

    C:\>timer exdupe.exe -o -x0 -t2 -v5 z:\vhd nul
    COMPRESSED 98,420,528,640 bytes in 1 file(s) into 74,385,508,529 bytes

    C:\>timer exdupe.exe -o -x0 -t1 -v5 z:\vhd nul
    COMPRESSED 98,420,528,640 bytes in 1 file(s) into 73,577,707,185 bytes (using 1200 mbytes of RAM)
    Kernel Time = 20.779 = 00:00:20.779 = 3%
    User Time = 653.347 = 00:10:53.347 = 100%
    Process Time = 674.127 = 00:11:14.127 = 104%
    Global Time = 647.716 = 00:10:47.716 = 100%

    C:\>for %a in (1 2 3) do @srep64g.exe -m%ao -v -l8k -nomd5 z:\vhd nul
    fixed-window content-defined chunking: 73,342,817,924. Cpu 1129 mb/s (83.149 sec), real 456 mb/s (205.697 sec) = 40%
    order-1 content-defined chunking: 70,747,488,718. Cpu 568 mb/s (165.345 sec), real 465 mb/s (201.881 sec) = 82%
    fixed 8kb chunks: 66,713,317,096. Cpu 100 mb/s (939.781 sec), real 132 mb/s (711.797 sec) = 132%


    So, exdupe may improve compression ration by incorporating zpaq chunking, and improve the speed by using faster sha1/sha2 implementations (about 200 seconds of cpu time for the vhd file) or BMW/Blake2 hashes (less than 100 seconds). After all optimizations, it should be able to reach zpaq-style compression ratio using 150-200 cpu seconds
    Last edited by Bulat Ziganshin; 20th May 2013 at 19:42.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    FMA secrets (to be written...)

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    REP secrets (to be written...)

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    SREP secrets (to be written...)

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    How is it that fixed 8kb chunks gives the best compression and worst speed?

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    source chunks are fixed, while destination adresses are arbitrary. i.e. srep splits data into fixed-size 8kb blocks and then checks every position in the file for match with these blocks. it's why it's slow - virtually, it performs one memory access per one byte of input data, while with content-defined chunking you perform single memory access per 8kb of data

    PS: i accept better names for this scheme. other schemes are already renamed to "fixed-window content-defined chunking" and "order-1 content-defined chunking" - may be there are some better names too?
    Last edited by Bulat Ziganshin; 20th May 2013 at 19:44.

  8. #8
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    ZFS?
    LessFS?
    OpenDedupe (was it that name)?

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Long range LZ77?

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    what you mean, Matt?

    m^2, i studied lessfs sources only to find that it perfroms only trivial full-block deduplication

    overall, this topic reflects my interest in deduplication algorithms, so i plan to include the following:
    • deduplication programs, to benchmark against and to investigate their properties
    • deduplication algorithms (sources) to borrow from
    • papers describing various ideas, applicable to deduplication


    wikipedia lists many software using deduplication as the part of process, but they aren't of interest for me except when they have some source code with original ideas

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I mean srep uses LZ77 match codes anywhere in the file to refer to matches of length exactly 8K and offsets that are multiples of 8K from the start, and uncompressed literals, right? Or am I misunderstanding?

  12. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    srep encodes using classic LZ77 (i.e. matches and literals are strictly interleaved), in mode -m3 (former -m1) with -l8k parameter, match lengths are N*8K and match source offsets are K*8K. In this mode each match+literal pair encoded using 3 32-bit words: match offset/L, match length/L and literal length (where L is -l option value)

  13. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Most dedupe systems out there are online in the sense that they allow you to add and remove pieces of data as you please (not to be confused with 'inline' where you dedude data on the way to storage as opposed to storing first, deduping later).
    It's a hard requirement for nearly all use cases. The only situation that I can think of where a stronger but offline dedupe is superior is transfer of a duplicated dataset when all data is available before the transfer starts.
    Maybe I'm wrong, but I view srep as a product that fits a very tiny niche and w/out perspectives to get out of it.
    If that's what you want, fine. But unless you mind that the niche is small and unless you can somehow make srep online, don't dismiss such 'trivial' algorithms.

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    me: lessfs is a trivial algorithm
    you: srep is a useless application

    why you think that i disagree with you? i'm just looking here into non-trivial deduplication algorithms that is the large area of its own

  15. #15
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    me: lessfs is a trivial algorithm
    you: srep is a useless application

    why you think that i disagree with you? i'm just looking here into non-trivial deduplication algorithms that is the large area of its own
    1. I haven't said that srep is useless but that it's niche. That's an important distinction.
    2. I thought you disregarded block level dedupe as trivial and not worth attention. If I was wrong, I apologise.

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it' a trvial algorithm. may be you don't realized what we say about different things - i look into algorithms and you look into applicatios

  17. #17
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I generalise. I don't know lessfs specifically, but I've read lots of articles about cool techniques used in blockwise dedupe algorithms to improve their performance, memory consumption, scalability.

  18. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    just post these papers here. i have a lot of papers, and will add them to the first post. that's teh difference - mentioning lessfs in the first post is useless becuase reader will need to scan lot of code just to find that lessfs uses trivial algorithm. mentioning concrete paper that describes useful algorithmic technique will be great
    Last edited by Bulat Ziganshin; 21st May 2013 at 21:03.

  19. #19
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    FAST is great. Every year starting with 2010 there have been dedupe papers, some time before too:

    http://static.usenix.org/event/fast0.../zhu/zhu_html/
    http://static.usenix.org/events/fast10/tech/
    http://static.usenix.org/events/fast11/tech/
    http://static.usenix.org/events/fast12/tech/
    https://www.usenix.org/conference/fa...ast-13-program

    I see others from time to time. I can link when I stumble upon something cool. I may have something on my disk, but on Unix I lack tools to search through it efficiently. I miss TC.

  20. The Following User Says Thank You to m^2 For This Useful Post:

    Bulat Ziganshin (21st May 2013)

Similar Threads

  1. convert swf files to avi files
    By Jabilo in forum The Off-Topic Lounge
    Replies: 13
    Last Post: 26th October 2016, 11:39
  2. pcompress, a deduplication/compression utility
    By moinakg in forum Data Compression
    Replies: 152
    Last Post: 5th March 2015, 15:29
  3. Data deduplication
    By Lasse Reinhold in forum Data Compression
    Replies: 79
    Last Post: 18th November 2013, 08:49
  4. native deduplication in Windows 8 x64
    By jimbow in forum Data Compression
    Replies: 6
    Last Post: 30th October 2012, 23:57
  5. A Microsoft study on deduplication
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 5th May 2011, 18:15

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •