Page 1 of 2 12 LastLast
Results 1 to 30 of 37

Thread: A fast diffing engine

  1. #1
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts

    A fast diffing engine

    I need a tool to create diffs for the following purpose:
    -Binary data
    -Small, in memory buffers with any number of differences, often no similarities

    Really, the only thing that matters is speed. Right now I use a modified bspatch, but it's very slow, very roughly 1 MB/s on my machine when I replaced its bz2 compression with zlib -6 and almost all of it is spent in the diffing algo. I don't really care about size of patches as long as it's something reasonable. Is there something better readily available? I'd strongly prefer something permissively licensed (i.e. not GPL), but it's not a strict requirement...
    Last edited by m^2; 12th September 2011 at 21:44.

  2. #2
    Member Surfer's Avatar
    Join Date
    Mar 2009
    Location
    oren
    Posts
    203
    Thanks
    18
    Thanked 7 Times in 1 Post
    May be interesting http://encode.ru/threads/1164

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Thanks, but I've seen this already. The only alternative that's mentioned is xdelta, which is GPLed, so I'd rather not use it yet....

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, it'd be interesting if you could test that old thing -
    http://shelwien.googlepages.com/fma-diff_v0.rar
    http://nishi.dreamhosters.com/u/fma-diff_v0ss.rar (source)
    Its also a remote diff (you don't need both files on one side), and its not GPLed, and feedback could be helpful for me

  5. #5
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Thanks.

  6. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    First comment:
    Those global variables make it much harder to understand and reuse.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Its just a proof-of-concept script. Since I previously posted it (and still have in irc topic), there was no real feedback for it, so no progress.
    If you have some specific requests, I might be able to modify it, but likely won't do anything about style-related things.

  8. #8
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    It's not about style, it's about thread safety.
    I modified it to fix the issue already and now I'm contemplating how to turn it to a library. I need 2 interfaces, memory to memory and file to file. The simplest thing that I see that doesn't duplicate code is replacing FILE* with a template parameter and implement fread etc. equivalents that work on memory buffers. It's a fair overhead to copy things around, but amount of modifications to the code is minimal and overall it's easy to do. Or I could rip it and add some ifs inside. I don't have the code before my eyes, but when I looked at it yesterday, the option didn't look nice.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Basically you only need fma-hash.cpp/crc32.inc stuff. I guess I have a newer class somewhere, but that's kinda enough too.
    So you just have to understand the idea of "anchored hashing" (implemented in crc32.inc along with rolling crc32),
    and then its the same as symbol diff, just with 32-bit (or more) "symbols" (=fragment hashes).

  10. #10
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    It's kinda complicated and I'd like to have this finished in about 1/2 day, deadline for my MoS thesis is just 2 weeks away and I can't afford to overly complicate things.

    So unless I can make it work quick, I'll use bsdiff for now and replace it at unspecified time. Though I would really want to have something that works on big files because the thing I'm doing is, comparing to competition (precomp) very fast on big things and I'd like to put some flashy benchmark results in it.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, imho bsdiff is much harder to integrate than my tools. Its much more complicated and much less readable.
    As to precomp, that's fairly interesting, as I'm still stuck with deflate recompression myself.
    Can you tell what your "thing" is really able to do?

  12. #12
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    Well, imho bsdiff is much harder to integrate than my tools. Its much more complicated and much less readable.
    Well...no, at least not from my perspective. I'm not interested in hacking the algorithm but in making it work in my case and bsdiff is much simpler and more modifiable in this regard.

    Quote Originally Posted by Shelwien View Post
    As to precomp, that's fairly interesting, as I'm still stuck with deflate recompression myself.
    Can you tell what your "thing" is really able to do?
    Just precomp w/out bells and whistles (like many supported formats, recursion, ...) that's much faster. The main idea, trying to find how was the file created, stays unchanged.
    Being much faster should allow for integration of more compression engines and more in-depth searching, so it has potential to be stronger too.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > bsdiff is much simpler and more modifiable in this regard.

    That's sounds really strange to me... which bsdiff source do you use?

    > Just precomp w/out bells and whistles (like many supported formats,
    > recursion, ...) that's much faster. The main idea, trying to find
    > how was the file created, stays unchanged.

    Do you use zlib like schnaader does, or your own deflate implementation?
    What do you do about stream interleaving in output? (normal data/deflate data)
    I hope you don't create temp files like precomp?

    > Being much faster should allow for integration of more compression
    > engines and more in-depth searching, so it has potential to be stronger too.

    That's cool I guess.
    So what do you do for testing?

  14. #14
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > bsdiff is much simpler and more modifiable in this regard.

    That's sounds really strange to me... which bsdiff source do you use?
    http://www.pokorra.de/coding/bsdiff.html

    Quote Originally Posted by Shelwien View Post
    > Just precomp w/out bells and whistles (like many supported formats,
    > recursion, ...) that's much faster. The main idea, trying to find
    > how was the file created, stays unchanged.

    Do you use zlib like schnaader does, or your own deflate implementation?
    Zlib. And I may add 7-zip and gzip too. Certainly not before October though.
    Quote Originally Posted by Shelwien View Post
    What do you do about stream interleaving in output? (normal data/deflate data)
    I ask users to stay away from such aberrations. The program only supports a subset of gzip format as input because I was interested in speeding up search and nothing more and this seemed to be the easiest format to implement.
    Quote Originally Posted by Shelwien View Post
    I hope you don't create temp files like precomp?
    I do, why?
    Just kidding.
    ADDED: Sadly, it's one of bigger gains that I got.
    Quote Originally Posted by Shelwien View Post
    > Being much faster should allow for integration of more compression
    > engines and more in-depth searching, so it has potential to be stronger too.

    That's cool I guess.
    So what do you do for testing?
    All testing that I performed and intend to perform would be too much for a forum post, so I hope you'll be more specific.
    Last edited by m^2; 15th September 2011 at 10:21.

  15. #15
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Shelwein, I can't get how am I supposed to use it. At first, I saw that fma-diff takes 3 files, opens the first 2 for reading and the last for writing, so I assumed the first 2 are the new and the old file. However when I tried to use it this way, the output was practically a copy of the first file.
    So looking at the comment about the second file, "// broken file's blockcrc" and what you said, that I need only fma-hash, I tried fma-hash on the old file, it produced something big and binary. I tried to use it as an input for fma-diff, but the result was again a copy of the 1st file.
    So what's the right way of using it?

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > http://www.pokorra.de/coding/bsdiff.html

    Thanks, but I don't see much difference from original there.
    Well, in case that you missed it, here's my version - http://nishi.dreamhosters.com/u/bsdiff_sh2.rar
    Its not integrated with bzip, but lzma is better for that task anyway.

    The only problem with bsdiff is that its totally not a "fast diffing engine".
    Also the size of files which bsdiff can handle is limited by ~210M.

    Maybe you should consider some LZ with large windows for diffing (lzma, rep/srep).

    > Zlib. And I may add 7-zip and gzip too. Certainly not before October though.

    gzip compression is exactly the same as zlib. Or do you mean gzip headers?
    As to lzma, I think its not a good idea - there're too much options to bruteforce,
    and what would you use to compress it back in the end? lzma again?

    >> What do you do about stream interleaving in output? (normal data/deflate data)

    > I ask users to stay away from such aberrations. The program only
    > supports a subset of gzip format as input because I was interested
    > in speeding up search and nothing more and this seemed to be the
    > easiest format to implement.

    Well, pdfs,pngs and such are kinda good targets for deflate recompression.

    But this questions is still interesting for me, because I still didn't find
    a good solution. Suppose you're writing multiple logical streams to a single
    file, how do you separate them?
    1) Introduce a special marker (signature) and mask it in the data
    2) Store length prefixes for logical blocks
    3) Use a common entropy backend for all streams, where its easy to add control symbols.
    4) ???

    >> So what do you do for testing?
    > All testing that I performed and intend to perform would be too much for a forum post
    > so I hope you'll be more specific.

    Obviously I don't mean specific files which you may try to recompress, but the types of tests.
    For example, this is the test with different deflate encoders:
    http://encode.ru/threads/1184-Precom...ll=1#post25541

    In other words, usually its more efficient to prepare "exploits" to test a module,
    which are specially designed to generate a bug. For example, if you work with
    32-bit variables, then its quite likely that something might break at a >4G file.

    So the set of tests that your module passes is a good representation of its features,
    in a sense. That's why I asked, anyway.

    > So what's the right way of using it?

    There's a demo script in the binary archive (first link)...

    Anyway, fma-hash generates blockcrc/fingerprint files.
    (Its something like a torrent file with a small block size.)

    Then, fma-diff commandline is similar to other diffs, but
    here you have to specify (as usual) first file, and then
    a blockcrc of the second file (instead of actual file),
    and 3rd argument is the diff file.

    Code:
    fma-hash file1 file1.crc
    fma-diff file2 file1.crc file1to2.dif
    fma-patch file1 file1to2.dif fileX
    So in this case you can submit a blockcrc of local file1 to the server
    and get a patch which can turn it into file2, without actually transferring
    either of (presumably large) files.

  17. #17
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > http://www.pokorra.de/coding/bsdiff.html

    Thanks, but I don't see much difference from original there.
    Well, in case that you missed it, here's my version - http://nishi.dreamhosters.com/u/bsdiff_sh2.rar
    Its not integrated with bzip, but lzma is better for that task anyway.
    Deflate is even better, it's faster.

    Quote Originally Posted by Shelwien View Post
    The only problem with bsdiff is that its totally not a "fast diffing engine".
    Also the size of files which bsdiff can handle is limited by ~210M.
    Indeed. And my workstation doesn't have enough memory to handle 210 MB, IIRC you said that it needs 12N...

    Quote Originally Posted by Shelwien View Post
    Maybe you should consider some LZ with large windows for diffing (lzma, rep/srep).
    Yes, I considered it too and I still want to try it if time permits, except that for in-memory compression I need just about 128k dict, so I'd use a modified LZ4.

    Quote Originally Posted by Shelwien View Post
    > Zlib. And I may add 7-zip and gzip too. Certainly not before October though.

    gzip compression is exactly the same as zlib. Or do you mean gzip headers?
    You're wrong. When I chose to use gzip file format, I was sure that gzip-tool uses zlib internally, but it doesn't. It has its own deflate implementation that produces very different files.

    Quote Originally Posted by Shelwien View Post
    As to lzma, I think its not a good idea - there're too much options to bruteforce,
    and what would you use to compress it back in the end? lzma again?
    I also think that LZMA is not good, but for a different reason. I don't know how does it parse stuff, but I'm afraid that large window sizes may force using big chunks of data for testing. With deflate I found it's enough to take a 64k piece to determine compression mode in use. And "what to compress it with" is the other of my concerns, there's too little to be saved right now.

    Quote Originally Posted by Shelwien View Post
    >> What do you do about stream interleaving in output? (normal data/deflate data)

    > I ask users to stay away from such aberrations. The program only
    > supports a subset of gzip format as input because I was interested
    > in speeding up search and nothing more and this seemed to be the
    > easiest format to implement.

    Well, pdfs,pngs and such are kinda good targets for deflate recompression.
    No doubt. But it's a research tool. I am thinking about turning it to something practical, but I'm not sure if I want to do it. You know, it's boring. When I found out that I have to use brute force because smart searches (searching for minimal patch size) can't work because on many files patch size is like Kronecker delta, either hit or miss, no feedback at all, the current thing turned boring too.

    Quote Originally Posted by Shelwien View Post
    But this questions is still interesting for me, because I still didn't find
    a good solution. Suppose you're writing multiple logical streams to a single
    file, how do you separate them?
    1) Introduce a special marker (signature) and mask it in the data
    2) Store length prefixes for logical blocks
    3) Use a common entropy backend for all streams, where its easy to add control symbols.
    4) ???
    I write to 2 files, one with data the other with metadata. And when months ago I was thinking about zip suport, I decided that the best way is to output a single directory, so a coder can sort things and has it easy to determine which filter to use. The only problems that I see with it is:
    -needs a multifile archiver. Doesn't matter
    -portability. What is a valid filename on one filesystem may not be valid on another.


    Quote Originally Posted by Shelwien View Post
    >> So what do you do for testing?
    > All testing that I performed and intend to perform would be too much for a forum post
    > so I hope you'll be more specific.

    Obviously I don't mean specific files which you may try to recompress, but the types of tests.
    For example, this is the test with different deflate encoders:
    http://encode.ru/threads/1184-Precom...ll=1#post25541
    I'm testing with zlib,did a few tests with gzip and random data.


    Quote Originally Posted by Shelwien View Post
    In other words, usually its more efficient to prepare "exploits" to test a module,
    which are specially designed to generate a bug. For example, if you work with
    32-bit variables, then its quite likely that something might break at a >4G file.

    So the set of tests that your module passes is a good representation of its features,
    in a sense. That's why I asked, anyway.
    Actually ATM the limit is about 2 GB.
    Like I said, the tool is not meant to be practical yet. So I add a lot of limitations, just try to implement things in a way that makes it easy to fix later.
    And also, there is no test module.

    Quote Originally Posted by Shelwien View Post
    > So what's the right way of using it?

    There's a demo script in the binary archive (first link)...
    Missed it...
    EDIT: No, I ignored the binary package and went straight to sources.

    Quote Originally Posted by Shelwien View Post
    Anyway, fma-hash generates blockcrc/fingerprint files.
    (Its something like a torrent file with a small block size.)

    Then, fma-diff commandline is similar to other diffs, but
    here you have to specify (as usual) first file, and then
    a blockcrc of the second file (instead of actual file),
    and 3rd argument is the diff file.

    Code:
    fma-hash file1 file1.crc
    fma-diff file2 file1.crc file1to2.dif
    fma-patch file1 file1to2.dif fileX
    So in this case you can submit a blockcrc of local file1 to the server
    and get a patch which can turn it into file2, without actually transferring
    either of (presumably large) files.
    Thanks for explanation. I tried to use it this way, but there was no reduction though the files were similar. I'll post the files that I tried shortly, have to turn on my workstation.
    EDIT: I tried it and it worked. I guess I screwed something with hash creation parameters. Nevertheless the patch was very big, mismatch boundaries were rough, I tried /x4 /y4 and it helped, but certainly the boundaries were not 4 bytes.
    Last edited by m^2; 16th September 2011 at 09:39.

  18. #18
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    2 comments that you may find useful:
    First, memreqs. Could you do a reasonably pessimistic estimation?
    It appears to me that they are nowhere near bspatch, but still cN for diffing with fairly high c. It would be much better to fit in constant space. And really, I think that even c == 0.1 is quite bad.

    Second, hash function. I haven't analysed it, but it looks like a regular CRC. CRC is wrong. And even if it's something else, why is anybody supposed to trust that a hash function written by some random dev they never heard of is collision resistant? You should really use some well known cryptographic hash. Or at very least write a paper with detailed analysis of resistance with conclusion that collisions are not a problem and won't be for another couple of decades of data growth.

  19. #19
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Update:
    I replaced bsdiff with LZ4 and performance doubled. I was disappointed at first, but looking deeper it's not bad actually because I don't try to make a patch when compressed output has totally wrong size. Which is 98% of the time over a full search.
    BTW I'm at 3 msec/mode tried with 1 thread on my slow Pentium D. Which means that w/out heuristics it still takes 40 seconds to do a full search over a single version of zlib (12600 trials). Limited to basic modes - practically 0. Doing a basic search over popular deflate implementations should take about 0 too, so the only real cost would be decompression with verification.
    Last edited by m^2; 16th September 2011 at 14:02.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    >> Its not integrated with bzip, but lzma is better for that task anyway.
    > Deflate is even better, it's faster.

    Well, comparing to deflate, bzip is still better.

    Code:
    idawll1 1326080
    idawll2 1320448
    12.dif  1326804 // bsdiff_sh
    12.zip   121012 // 7z -tzip -mx9
    12.bz2   119266 // 7z -tbzip2 -mx9
    12.7z    114169 // 7z -mx9
    12.lzma  106376 // lzma -fb273 -mc999999999 -lc0 -lp0 -pb0 -mfbt4 -mt1
    
    (1-106376/121012)*100 = ~12%
    >>Maybe you should consider some LZ with large windows for diffing (lzma, rep/srep).
    > Yes, I considered it too and I still want to try it if time permits,
    > except that for in-memory compression I need just about 128k dict,
    > so I'd use a modified LZ4.

    Erm, afaiu you need a window size larger than first diffed file
    to make it work with normal LZ.

    >> gzip compression is exactly the same as zlib. Or do you mean gzip headers?
    > You're wrong. When I chose to use gzip file format, I was sure
    > that gzip-tool uses zlib internally, but it doesn't. It has its own
    > deflate implementation that produces very different files.

    Well, I'm not really wrong. Surely, there're some subtle differences,
    but the encoding algos are the same. Afair the main differences are
    that gzip does some analysis to optimize the block segmentation
    (I think zlib api has a function for this), while zlib may be used
    with fixed-size blocks; also headers, block padding, and crcs.

    Anyway, I didn't really try, but there should be a way to produce
    with zlib the same streams as gzip does.

    > I also think that LZMA is not good, but for a different reason. I
    > don't know how does it parse stuff, but I'm afraid that large window
    > sizes may force using big chunks of data for testing.

    No, my lzmarec also uses chunks of up to 64k for lzma detection.
    Certainly there's less redundancy than in deflate (normally), but
    luckily lzma always allocates (redundant) intervals for out-of-window distances,
    so ~100 bytes of random data is normally enough to detect an error.

    So its only troublesome when lzma stream starts with compressed code of
    some very redundant data. Something like 10M of zeroes.

    Still, currently I think that deflate is probably harder to detect
    (especially the end of a broken stream), because its relatively easy
    to provide 32k of window data for it, and then any data would look
    like valid huffman code.
    Thus the detection would be mainly about header validation - hopefully
    there's enough chances to detect an error.
    But then again, deflate also has "stored" blocks, which only require
    18 bits to accidentally match, so misdetection is kinda hard to avoid.

    > You know, it's boring. When I found out that I have to use brute
    > force because smart searches (searching for minimal patch size)
    > can't work because on many files patch size is like Kronecker delta,
    > either hit or miss, no feedback at all, the current thing turned
    > boring too.

    Well, if its boring, it probably means that you're doing it wrong.
    To me deflate recompression is much harder and interesting than
    lzma recompression.

    Also are you really using these diffs only for deflate detection?!
    Actually getting matching huffman code is pretty unlikely even if
    you'd do bitwise diffs somehow.

    > I write to 2 files, one with data the other with metadata

    Well, you're oversimplifying things for youself then.
    No wonder its boring :)

    A real practical codec should be streamable (lzmarec is, btw).
    Also for deflate recompressor its especially important,
    because it really needs recursion support - stuff like .pngs in .docx
    or .pdfs in .zip is not so rare these days.
    With a streamable codec you can just pass the stream through a few
    instances, otherwise things get more complicated (eg. how do you handle
    out-of-hdd-space when writing temp files?)

    > I tried it and it worked. I guess I screwed something with hash
    > creation parameters. Nevertheless the patch was very big, mismatch
    > boundaries were rough, I tried /x4 /y4 and it helped, but certainly
    > the boundaries were not 4 bytes.

    Well, I thought that you actually need a diff when I suggested that :)
    Surely its intended for use with much longer matches - I tested it
    with stuff like diffing mp4 and mkv versions of the same video,
    so it should be able to process large amounts of data fast enough.
    Well, its what'd I'd call a "fast diffing engine", but your case
    is completely different, I guess.

    Still, now I'd suggest to look at http://encode.ru/threads/591-Unalign...ing-experiment
    which is based on the same hashing algo.

    > First, memreqs. [...] It would be much better to fit in constant space

    Obviously it can be implemented with a fixed-size hashtable, though there'd
    be more missed matches.

    > Second, hash function. I haven't analysed it, but it looks like a
    > regular CRC. CRC is wrong.

    Yes, its a regular crc32 which is good, because its random enough
    (especially comparing to LCG hashes and stuff like adler32)
    and easy to work with (there's even a cpu instruction now),
    and you need it in any archiver anyway.

    Just think about it a little.
    Hash functions (used for hashtables and such) and cryptographic hashes
    have completely different applications.
    Hash function collisions are expected, and it should be fast.

    Also good rolling hashes are even harder to find, and I'm really
    happy that I managed to implement a rolling crc32, because its
    much more random than stuff that Bulat uses in rep/srep.

    > And even if it's something else, why is anybody supposed to trust
    > that a hash function written by some random dev they never heard of
    > is collision resistant?

    Hash function isn't expected to be collision resistant.

    > You should really use some well known cryptographic hash.

    No, because collision stats for crc32 and 32-bits of sha1 won't be
    really different, unlike their performance.

    > Or at very least write a paper with detailed analysis of resistance
    > with conclusion that collisions are not a problem and won't be for
    > another couple of decades of data growth.

    Its a remote diff, so there's no way to really check whether its a
    real match or collision.
    Of course, normally you have to check all matches.
    That "fma" thing is just a method to find relatively long matches really fast,
    without slow memory lookups per each data byte.

  21. #21
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    >> Its not integrated with bzip, but lzma is better for that task anyway.
    > Deflate is even better, it's faster.

    Well, comparing to deflate, bzip is still better.

    Code:
    idawll1 1326080
    idawll2 1320448
    12.dif  1326804 // bsdiff_sh
    12.zip   121012 // 7z -tzip -mx9
    12.bz2   119266 // 7z -tbzip2 -mx9
    12.7z    114169 // 7z -mx9
    12.lzma  106376 // lzma -fb273 -mc999999999 -lc0 -lp0 -pb0 -mfbt4 -mt1
    
    (1-106376/121012)*100 = ~12%
    I lost the data, but with zlib -6 the program was about 5% faster than with bzip2. No easily measurable speed difference between -6 and -1.

    Quote Originally Posted by Shelwien View Post
    >>Maybe you should consider some LZ with large windows for diffing (lzma, rep/srep).
    > Yes, I considered it too and I still want to try it if time permits,
    > except that for in-memory compression I need just about 128k dict,
    > so I'd use a modified LZ4.

    Erm, afaiu you need a window size larger than first diffed file
    to make it work with normal LZ.
    I wanted to use a diff engine in 2 situations:
    -To determine whether the compression mode that I'm trying is good, I minimize size of patch
    -To use it in the output when I can find a match that's *almost* perfect.

    During determination, I use only small (64k) samples from the file, so 128k window is almost enough to fit both (almost, because I need to leave zlib some headroom, 1k).

    I decided against using in in the output for now because I'm yet to find a single case with *almost* perfect match. All that I've seen are either perfect or very bad, so my tool fails when nothing exact can be found. I remember that Schnaader said that such matches happen, but I don't want to implement it until I see it.

    Quote Originally Posted by Shelwien View Post
    >> gzip compression is exactly the same as zlib. Or do you mean gzip headers?
    > You're wrong. When I chose to use gzip file format, I was sure
    > that gzip-tool uses zlib internally, but it doesn't. It has its own
    > deflate implementation that produces very different files.

    Well, I'm not really wrong. Surely, there're some subtle differences,
    but the encoding algos are the same. Afair the main differences are
    that gzip does some analysis to optimize the block segmentation
    (I think zlib api has a function for this), while zlib may be used
    with fixed-size blocks; also headers, block padding, and crcs.

    Anyway, I didn't really try, but there should be a way to produce
    with zlib the same streams as gzip does.

    deflate.c from zlib and gzip 1.2.4. White is shared code, grey is no code and yellow is different code. Many differences are stuff like asserts and comments, but there are numerous real ones too. And I'm yet to find a file > 100 KB that would compress the same with both, even though gzip always seems to use 1 block in such cases. In fact, I'm yet to find a case where using zlib to recompress gzip would be worthwhile, though I can't say I tried hard. You know, there are similarities, you break even if you compress book1 to about 80k. I tried all explicit zlib parameters and some implicit ones (like feeding it with differently sized chunks of data) and it just didn't work.

    Quote Originally Posted by Shelwien View Post
    > I also think that LZMA is not good, but for a different reason. I
    > don't know how does it parse stuff, but I'm afraid that large window
    > sizes may force using big chunks of data for testing.

    No, my lzmarec also uses chunks of up to 64k for lzma detection.
    You mean, detection whether some piece of data is a lzma stream? I don't do this, I assume that data mostly follows gzip spec. I use 64k chunks to detect compression mode.

    Quote Originally Posted by Shelwien View Post
    > You know, it's boring. When I found out that I have to use brute
    > force because smart searches (searching for minimal patch size)
    > can't work because on many files patch size is like Kronecker delta,
    > either hit or miss, no feedback at all, the current thing turned
    > boring too.

    Well, if its boring, it probably means that you're doing it wrong.
    To me deflate recompression is much harder and interesting than
    lzma recompression.
    I don't feel capable to do it your way. And mine looked OK in the prospects, but later turned out to be not so good.

    Quote Originally Posted by Shelwien View Post
    Also are you really using these diffs only for deflate detection?!
    Actually getting matching huffman code is pretty unlikely even if
    you'd do bitwise diffs somehow.
    I mostly answered it already, but I'll add that indeed, in the detection phase I never find perfect matches, it's just that the correct one stands out as the most accurate. Sometimes patch size is like 50 KB, sometimes 20 but over 64k it seems to always be the minimum. I'm not entirely sure that 64k is the perfect size yet, but I am sure that in general, the trick works.

    Quote Originally Posted by Shelwien View Post
    > I write to 2 files, one with data the other with metadata

    Well, you're oversimplifying things for youself then.
    No wonder its boring
    Nah. Implementing another file format would be extremely boring to me.

    Quote Originally Posted by Shelwien View Post
    A real practical codec should be streamable (lzmarec is, btw).
    Also for deflate recompressor its especially important,
    because it really needs recursion support - stuff like .pngs in .docx
    or .pdfs in .zip is not so rare these days.
    With a streamable codec you can just pass the stream through a few
    instances, otherwise things get more complicated (eg. how do you handle
    out-of-hdd-space when writing temp files?)
    I've been thinking about it and thought about a processing chain where any element would take and return a set of streams, not just one. 7-zip has something along these lines, but I think simpler. It doesn't work with OS pipes unless there's some (de)multiplexer, but inside an archiver it would be OK.

    Quote Originally Posted by Shelwien View Post
    > I tried it and it worked. I guess I screwed something with hash
    > creation parameters. Nevertheless the patch was very big, mismatch
    > boundaries were rough, I tried /x4 /y4 and it helped, but certainly
    > the boundaries were not 4 bytes.

    Well, I thought that you actually need a diff when I suggested that
    Surely its intended for use with much longer matches - I tested it
    with stuff like diffing mp4 and mkv versions of the same video,
    so it should be able to process large amounts of data fast enough.
    Well, its what'd I'd call a "fast diffing engine", but your case
    is completely different, I guess.
    Yeah.

    Quote Originally Posted by Shelwien View Post
    Still, now I'd suggest to look at http://encode.ru/threads/591-Unalign...ing-experiment
    which is based on the same hashing algo.
    Thanks, I think I'll look into it. But not before October, I finished coding for my thesis and I'll be busy with literary work.

    Quote Originally Posted by Shelwien View Post
    > First, memreqs. [...] It would be much better to fit in constant space

    Obviously it can be implemented with a fixed-size hashtable, though there'd
    be more missed matches.
    I think that some sliding window with configurable size would work about as good.

    Quote Originally Posted by Shelwien View Post
    > Second, hash function. I haven't analysed it, but it looks like a
    > regular CRC. CRC is wrong.

    Yes, its a regular crc32 which is good, because its random enough
    (especially comparing to LCG hashes and stuff like adler32)
    and easy to work with (there's even a cpu instruction now),
    and you need it in any archiver anyway.

    Just think about it a little.
    Hash functions (used for hashtables and such) and cryptographic hashes
    have completely different applications.
    Hash function collisions are expected, and it should be fast.
    It makes your algo unreliable and what's worse - with your usage scenario there's no easy way of testing patches. The only real solution would be to checksum them, but that's kinda backwards.
    Anyway, using it w/out testing of each patch is asking for disaster.

    Quote Originally Posted by Shelwien View Post
    Also good rolling hashes are even harder to find, and I'm really
    happy that I managed to implement a rolling crc32, because its
    much more random than stuff that Bulat uses in rep/srep.
    I'm pretty sure srep uses MD5.

    Quote Originally Posted by Shelwien View Post
    > And even if it's something else, why is anybody supposed to trust
    > that a hash function written by some random dev they never heard of
    > is collision resistant?

    Hash function isn't expected to be collision resistant.

    > You should really use some well known cryptographic hash.

    No, because collision stats for crc32 and 32-bits of sha1 won't be
    really different, unlike their performance.
    Search Charles Bloom's blog, he did testing of various hash functions and they did have a significant number of collisions. Besides, I didn't say that you should use a 32-bit hash function.
    I trust SHA-256 used by ZFS, but CRC32? No way.

    Quote Originally Posted by Shelwien View Post
    > Or at very least write a paper with detailed analysis of resistance
    > with conclusion that collisions are not a problem and won't be for
    > another couple of decades of data growth.

    Its a remote diff, so there's no way to really check whether its a
    real match or collision.
    Of course, normally you have to check all matches.
    And that's why I say the algo should be designed to avoid them.

  22. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    rolling crc32, because its much more random than stuff that Bulat uses in rep/srep.
    what you mean here? LCG hash used in srep has very low number of collisions (afair it was 42 or so for multigigabyte file). however i don't checked evenness of hash distribution. also it would be interesting to compare speeds

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I lost the data, but with zlib -6 the program was about 5% faster than with bzip2

    Well, originally I thought that you're somehow using deflate recompression and diffing
    in conjuction. Like diffing .tar.gz files with app version sources (or diffing pdf
    versions, but you said before that you don't handle pdfs).
    But for comparison of original and reencoded data you obviously don't need anything like that.

    > I wanted to use a diff engine in 2 situations:
    > -To determine whether the compression mode that I'm trying is good, I minimize size of patch
    > -To use it in the output when I can find a match that's *almost* perfect.

    1) deflate is not byte-aligned, and by using bytewise compression or diffing methods
    you're likely to miss a lot
    2) Its really not that simple in most cases. Even a small difference
    (eg. single 3-byte match encoded as match in one version and as 3 literals in other)
    would change block's statistics, which would affect the generated huffman code,
    so its very likely that most of the block would be different.
    3) For a unique task you need a unique solution. "Universal" methods certainly
    won't be efficient in this case. For example, the diffing would work much better
    if you'd apply it to raw deflate tokens, without the huffman code.
    At least, it would be a good idea to parse the deflate blocks in your own code -
    ie after finding a seemingly valid 10M deflate stream in the data, you shouldn't
    try to reencode the whole stream, but the blocks of that stream separately.

    > I remember that Schnaader said that such matches happen, but I don't
    > want to implement it until I see it.

    In practical cases it can be stuff like crcs at the end of stream,
    or maybe some garbage at the end of a broken stream.
    But for bytewise diffing to work on deflate streams, I don't see
    much chances, except that an example can be constructed artifically.

    > deflate.c from zlib and gzip 1.2.4.

    Well, I only know that the actual match sequence is the same in gzip and zlib
    output, though block sizes may be different.
    There's also a "minigzip" example in zlib, as it is, it creates ~2x smaller
    blocks than gzip, but after patching "s->lit_bufsize = 1 << (memLevel + 6);"
    into "+7" its output became exactly the same as gzip.

    > You mean, detection whether some piece of data is a lzma stream?

    Yes, without the option byte (lc/lp/pb set) its impossible to decode
    the lzma stream so it has to be bruteforced.
    But instead, I don't detect the other encoding parameters (fb,mc etc) at all -
    there're too much combinations anyway.

    > I don't do this, I assume that data mostly follows gzip spec.
    > I use 64k chunks to detect compression mode.

    Actually if you'd use your own decoder (not zlib) there'd be an option
    to directly detect (at least some of) the parameters, by analyzing values
    of distances (max distance value, max number of hash chain steps to find that distance).

    However using only a 64k chunk at start for parameter detection is not 100% safe.
    I think its be pretty likely (with some not very redundant types of data)
    that deflate output on eg. levels 8 and 9 would be exactly the same in the first 64k,
    but different later.

    > Implementing another file format would be extremely boring to me.

    Normally I won't call a problem which I can't solve "boring" :)
    (I mean something like stream interleaving which I mentioned)

    > It makes your algo unreliable and what's worse - with your usage
    > scenario there's no easy way of testing patches. The only real
    > solution would be to checksum them, but that's kinda backwards.
    > Anyway, using it w/out testing of each patch is asking for disaster.

    Well, that specific fma-diff is certainly unsafe.
    But when patching a remote file, its necessary to check its hash anyway
    (it may be as well broken due to a different reason than accidental
    crc collision), and a different method can be used in case of mismatch
    (which would be rare anyway).

    However, actually I have another idea for this - its possible to add
    some error recovery data to the archive, which would as well fix an
    accidental collision. Error recovery is useful anyway, and it seems
    fairly attractive to be able to skip actual data matching.

    > I'm pretty sure srep uses MD5.

    Code:
    -m1: check matches by SHA1 digest (compression memory = 6-7% of filesize)
    -m2: check matches by rereading old data (compression memory = 2-3% of filesize)
    -m3: byte-accurate matches and rereading (compression memory = 4-6% of filesize)
    Not for matchfinding and not in all modes.
    Also note the memory usage.

    > Besides, I didn't say that you should use a 32-bit hash function.

    Well, I recently use 64-bit hashes combined from two crc32s.
    Didn't see any "natural" collisions yet.
    Btw, I extended it after finding a fma-hash collision in enwik9,
    but with such a file size (1G) there's a high probability that
    any 32-bit hash would generate a collision.

    > I trust SHA-256 used by ZFS, but CRC32? No way.

    Well, trusting that stuff is not the best idea.
    It still can be forged with appropriate hardware, or just accidentally match.

    Also surely I'd like to use sha1 or even md5 as a hash function,
    but it just doesn't make sense, because of speed and memory considerations.

    > And that's why I say the algo should be designed to avoid them.

    You can't avoid them. In a set of all N-bit files, 2^(N-hashsize) of them
    would produce the same given hash value, so for N>hashsize you're never
    completely safe without comparing the actual data.
    On other hand, a combination of 4 crc32s would have roughly the same
    probability of _accidental_ collision as md5, while being much faster to compute.

    Anyway, common hash function are usually even less random than crc32, and
    even not necessarily faster.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    >> rolling crc32, because its much more random than stuff that Bulat uses in rep/srep.

    > what you mean here? LCG hash used in srep has very low number of
    > collisions (afair it was 42 or so for multigigabyte file).

    Well, my use case is different (fragment splitting on hash<limit),
    so the "quality" of hash function might be more important.

    > however i don't checked evenness of hash distribution. also it would be
    > interesting to compare speeds

    With rolling crc32 it looks kinda like this:
    Code:
      uint CRC32Update( uint x, byte c ) { 
        return CRCTab[byte(x)^c] ^ (x>>8); 
      }
    
      void Update( byte c, byte d ) {
        x = CRC32Update(x,c) ^ IncWinTab[d]; // rolling crc update
        y = CRC32Update(y,c); // normal crc update
        z = CRC32Update(z,c+1); // alt. crc update
        blklen++;
      }
    
      int Check( void ) {
        int flag = (blklen>=cfg_minblklen) && ((uint(x)<uint(cfg_anchormask)) | (blklen>=cfg_maxblklen));
        if( flag ) Flush();
        return flag;
      }
    
      while(1) {
        byte c = buf[inp_j++];
        inpwin[inp_i&inpwinM] = c;
        crc.Update( c, inpwin[(inp_i-winsize)&inpwinM] ); 
        inp_i++;
        if( crc.Check() ) hout.Store( g, crc.h );
      }
    Taking into account that hash lookups are only done when Check() returns true
    it might be as well faster than matchfinding in rep/srep.

  25. #25
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > I lost the data, but with zlib -6 the program was about 5% faster than with bzip2

    Well, originally I thought that you're somehow using deflate recompression and diffing
    in conjuction. Like diffing .tar.gz files with app version sources (or diffing pdf
    versions, but you said before that you don't handle pdfs).
    But for comparison of original and reencoded data you obviously don't need anything like that.
    It was supposed to allow me to skip checking of almost all data with some smart search algorithm. It didn't work, but later I tried and failed to remove this distance checking and use a regular comparison and failed. I know there is a bug and I think that yesterday, during a walk, I found it. I'm not returning to the code for a while though, no time. Still, diffing leaves it open to introduce patches later.

    Quote Originally Posted by Shelwien View Post
    > I wanted to use a diff engine in 2 situations:
    > -To determine whether the compression mode that I'm trying is good, I minimize size of patch
    > -To use it in the output when I can find a match that's *almost* perfect.

    1) deflate is not byte-aligned, and by using bytewise compression or diffing methods
    you're likely to miss a lot
    I know. That's why I'm seeing even 50k differences. But being time constrained I was looking for the best bang for the buck solutions and developing such tools was out of question. And I'm quite satisfied with how it turned out even though it could have been better and I still have many improvements to make.
    Quote Originally Posted by Shelwien View Post
    2) Its really not that simple in most cases. Even a small difference
    (eg. single 3-byte match encoded as match in one version and as 3 literals in other)
    would change block's statistics, which would affect the generated huffman code,
    so its very likely that most of the block would be different.
    3) For a unique task you need a unique solution. "Universal" methods certainly
    won't be efficient in this case. For example, the diffing would work much better
    if you'd apply it to raw deflate tokens, without the huffman code.
    At least, it would be a good idea to parse the deflate blocks in your own code -
    ie after finding a seemingly valid 10M deflate stream in the data, you shouldn't
    try to reencode the whole stream, but the blocks of that stream separately.
    Noted. I think it's too much effort to be worth it (especially that I intend to remove diffing entirely and replace with comparison )

    Quote Originally Posted by Shelwien View Post
    > I remember that Schnaader said that such matches happen, but I don't
    > want to implement it until I see it.

    In practical cases it can be stuff like crcs at the end of stream,
    or maybe some garbage at the end of a broken stream.
    But for bytewise diffing to work on deflate streams, I don't see
    much chances, except that an example can be constructed artifically.
    Mhm

    Quote Originally Posted by Shelwien View Post
    > deflate.c from zlib and gzip 1.2.4.

    Well, I only know that the actual match sequence is the same in gzip and zlib
    output, though block sizes may be different.
    There's also a "minigzip" example in zlib, as it is, it creates ~2x smaller
    blocks than gzip, but after patching "s->lit_bufsize = 1 << (memLevel + 6);"
    into "+7" its output became exactly the same as gzip.
    I'll look into it later, though I'm sceptical.

    Quote Originally Posted by Shelwien View Post
    > I don't do this, I assume that data mostly follows gzip spec.
    > I use 64k chunks to detect compression mode.

    Actually if you'd use your own decoder (not zlib) there'd be an option
    to directly detect (at least some of) the parameters, by analyzing values
    of distances (max distance value, max number of hash chain steps to find that distance).
    That's a cool idea. Nevertheless I have far too little data to judge whether it's really useful. How many deflated files use artificially reduced window size? I think hardly any, therefore heuristic on what is unnecessary would be a simpler and faster solution. But I didn't even start to work on heuristics, I first have to release a testable version and gather some data of real world performance.

    Quote Originally Posted by Shelwien View Post
    However using only a 64k chunk at start for parameter detection is not 100% safe.
    I think its be pretty likely (with some not very redundant types of data)
    that deflate output on eg. levels 8 and 9 would be exactly the same in the first 64k,
    but different later.
    I've been thinking about misdetections and have 2 solutions (or actually 2 variants of the same solution) for it:
    - increase buffer size until there's only 1 mode left
    - remember a couple of best modes and during verification, use them all (in separate threads)
    I was even close to implementing it, but decided again to skip it until I have evidence of usefulness.
    BTW the longest match over misdetection that I encountered to date was 12k, I can't claim that I did careful evaluation though.

    Quote Originally Posted by Shelwien View Post
    > Implementing another file format would be extremely boring to me.

    Normally I won't call a problem which I can't solve "boring"
    (I mean something like stream interleaving which I mentioned)
    I sometimes do. I just don't feel it's my business and while I don't really know why is it so (after all it's close to what I do), it lays beyond my field of interest, so it doesn't really matter whether it's hard or not.

    Quote Originally Posted by Shelwien View Post
    > It makes your algo unreliable and what's worse - with your usage
    > scenario there's no easy way of testing patches. The only real
    > solution would be to checksum them, but that's kinda backwards.
    > Anyway, using it w/out testing of each patch is asking for disaster.

    Well, that specific fma-diff is certainly unsafe.
    But when patching a remote file, its necessary to check its hash anyway
    (it may be as well broken due to a different reason than accidental
    crc collision), and a different method can be used in case of mismatch
    (which would be rare anyway).

    However, actually I have another idea for this - its possible to add
    some error recovery data to the archive, which would as well fix an
    accidental collision. Error recovery is useful anyway, and it seems
    fairly attractive to be able to skip actual data matching.
    But then you need another way to produce a patch and overall it's extremely complicated and hardly high-performance way of solving an easy problem.

    Quote Originally Posted by Shelwien View Post
    > I'm pretty sure srep uses MD5.

    Code:
    -m1: check matches by SHA1 digest (compression memory = 6-7% of filesize)
    -m2: check matches by rereading old data (compression memory = 2-3% of filesize)
    -m3: byte-accurate matches and rereading (compression memory = 4-6% of filesize)
    Not for matchfinding and not in all modes.
    Also note the memory usage.
    Surprise. I knew something was wrong, so I checked and it indeed uses MD5...just for a different purpose, hashing entire data block. Anyway, SHA-1 is not bad either.

    Quote Originally Posted by Shelwien View Post
    > Besides, I didn't say that you should use a 32-bit hash function.

    Well, I recently use 64-bit hashes combined from two crc32s.
    Didn't see any "natural" collisions yet.
    Btw, I extended it after finding a fma-hash collision in enwik9,
    but with such a file size (1G) there's a high probability that
    any 32-bit hash would generate a collision.
    Seriously, check some hash function comparisons, I'm yet to meet a fast one that would have collision resistance close to theoretical limit.

    Quote Originally Posted by Shelwien View Post
    > I trust SHA-256 used by ZFS, but CRC32? No way.

    Well, trusting that stuff is not the best idea.
    It still can be forged with appropriate hardware, or just accidentally match.
    Initially when I was reading about ZFS dedupe, I was disappointed that verifying matches was optional and by default it relied purely on hash matches. I was sure that if I used it (on my current data, I don't think it's worth it), I would make it check that data is intact, the performance penalty seemed worthwhile. But later I read a post from some ZFS team member who pointed out that hardware error ratio on the 'best computers money can buy' was IIRC over 30 orders of magnitude higher than that of SHA-256. I laughed at how supersticious I was before and learned that there are cases when designing correct algorithms simply makes no sense.

    Quote Originally Posted by Shelwien View Post
    Also surely I'd like to use sha1 or even md5 as a hash function,
    but it just doesn't make sense, because of speed and memory considerations.
    For me it's clear that the current solution is bad and the simplest way of fixing the stuff would be to use a better hash.

    Quote Originally Posted by Shelwien View Post
    > And that's why I say the algo should be designed to avoid them.

    You can't avoid them. In a set of all N-bit files, 2^(N-hashsize) of them
    would produce the same given hash value, so for N>hashsize you're never
    completely safe without comparing the actual data.
    Sure, there are limits, but like I said above, at some point they don't matter.

    Quote Originally Posted by Shelwien View Post
    On other hand, a combination of 4 crc32s would have roughly the same
    probability of _accidental_ collision as md5, while being much faster to compute.

    Anyway, common hash function are usually even less random than crc32, and
    even not necessarily faster.
    I've seen several comparisons of fast hash functions and none was collision rate even close to theoretical limit. So it's not as good as you say. It is very possible that CRC would have a better resistance vs speed curve than a cryptographic hash, but a good validation would be much less straightforward than saying "it's good.".

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    >> There's also a "minigzip" example in zlib
    > I'll look into it later, though I'm sceptical.

    Here's a demo:
    http://nishi.dreamhosters.com/u/minigzip_v0.rar

    minigzip built with zlib 1.2.5, modified deflate.c is included.

    > How many deflated files use artificially reduced window size?

    I guess the main reason to do that is a little different.
    There's a funny heuristic in zlib that discards 3-byte matches
    if the distance is >4k, while pkzip (aka securezip) further limits it to 256.

    > But then you need another way to produce a patch and overall it's
    > extremely complicated and hardly high-performance way of solving an
    > easy problem.

    Comparing to xdelta or bsdiff, its a very simple and fast way to patch big files.

    Also what you call an "easy problem" here?
    Referencing a possible match 20GB back in a stream?
    I still think that adding some ECC data to compressed stream would be a good
    workaround for possible collisions, instead of having to write a temp file
    and read from it to check uncached matches (like Bulat does).

    > Seriously, check some hash function comparisons, I'm yet to meet a
    > fast one that would have collision resistance close to theoretical limit.

    Why should I care about somebody's comparisons when I can test them myself
    in my specific algorithms?
    Anyway, atm I only know two kinds of rolling hashes - LCG and crc,
    and among these, crc looks better to me.

    And what's the "theoretical limit of collision resistance"?
    I'm pretty sure that if I'd try to enumerate all files of length N,
    there'd be exactly 256^N/2^32 collisions among their crc32s, and
    you can't do better with a 32-bit hash.

    > Initially when I was reading about ZFS dedupe [...]

    Actually in case of detection of duplicate clusters (of size 4kb or more),
    I'd probably use a stronger hash myself, because data-to-hash size ratio
    allows it (though then again, I'd bet that ZFS can't detect unaligned
    duplicate blocks, if its like that).

    As to hardware errors though, I don't exactly agree, because its much
    more likely to get a hardware error in a sha1 dedup than in one based on crc32,
    just because sha1 computing puts significantly more load on cpu.

    > I've seen several comparisons of fast hash functions and none was
    > collision rate even close to theoretical limit

    Well, in Bloom's post which you (probably) mentioned, crc32 looks good:
    http://cbloomrants.blogspot.com/2010...0-adler32.html

  27. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    m^2, i recommend you to start with reading http://samba.org/~tridge/phd_thesis.pdf - both fast rolling hash and sha1 is required for deduplication algorithm. but i believe that any reasonable algorithm, including fcg, crc and sha would be pretty close in collisions detection - the more important side is the number of bits in the hash. srep computes 64-bit fcg hash that's enough to detect most of collisions. but i've seen some (10-100) collisions on multigigabyte file and afair it was problem of handling zeroes. nevertheless, i use sha1 because it's fast enough, computed in separate thread and 100% ensures data integrity. but if you have fallback solution (i.e. full download in the case when deduplicated data fails CRC check), simple 32-bit hash would be enough

    Eugene, actually i wonder why you need the fast hash. imho, if you need to update 10-100mb software to the lastest version, you should download, say, 64-bit hashes of 4 kb blocks for the new version, look for the same hashes in the already installed version and then directly download missing blocks. you may build full MSI installer this way, for example, and then run it to update the program

    ... back to hashes: we should compare amount of collisions on REAL data and compare it to the theoretical limit. for measuring efficiency (as opposed to data integrity) on 32+ byte blocks i don't expect any real problem with crc or murmur, even my FCG is good enough with those 10-100 real collisions per gigabytes

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Nah, I used bsdiff in updater anyway, there's no point to use a remote diff when I have both versions locally.
    But I'm planning to apply anchored hashing as a rep equivalent in my new format - I want to implement archive
    updates where previously written data stay unmodified.

  29. #29
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > How many deflated files use artificially reduced window size?

    I guess the main reason to do that is a little different.
    There's a funny heuristic in zlib that discards 3-byte matches
    if the distance is >4k, while pkzip (aka securezip) further limits it to 256.
    Mhm. You mean we can analyse the stream and find "oh, there are no 3-btye matches when distance >256" and be able to guess that the source was pkzip. Might be good, but I'll fully optimize what I have first. Deeply analysing different deflate implementations to find such tricks is quite a lot of work and while I see that it might be the most effective method, I don't think I want to do it.

    Quote Originally Posted by Shelwien View Post
    > But then you need another way to produce a patch and overall it's
    > extremely complicated and hardly high-performance way of solving an
    > easy problem.

    Comparing to xdelta or bsdiff, its a very simple and fast way to patch big files.
    Replacing hash with a stronger one in your code is faster and much simpler solution than adding ECC and some fallback mechanism.

    Quote Originally Posted by Shelwien View Post
    Also what you call an "easy problem" here?
    Referencing a possible match 20GB back in a stream?
    I still think that adding some ECC data to compressed stream would be a good
    workaround for possible collisions, instead of having to write a temp file
    and read from it to check uncached matches (like Bulat does).
    The solution I propose from the start is using a hash that has practically 0 probability of producing a collision.

    > Seriously, check some hash function comparisons, I'm yet to meet a
    > fast one that would have collision resistance close to theoretical limit.

    Why should I care about somebody's comparisons when I can test them myself
    in my specific algorithms?
    Anyway, atm I only know two kinds of rolling hashes - LCG and crc,
    and among these, crc looks better to me.
    (...)
    > I've seen several comparisons of fast hash functions and none was
    > collision rate even close to theoretical limit

    Well, in Bloom's post which you (probably) mentioned, crc32 looks good:
    http://cbloomrants.blogspot.com/2010...0-adler32.html
    Yes, it's this one

    Something's wrong with my memory, I talk nonsense far too frequently. I guess I'm working too much for too long. Sorry for being misleading.

    Quote Originally Posted by Shelwien View Post
    > Initially when I was reading about ZFS dedupe [...]

    Actually in case of detection of duplicate clusters (of size 4kb or more),
    I'd probably use a stronger hash myself, because data-to-hash size ratio
    allows it (though then again, I'd bet that ZFS can't detect unaligned
    duplicate blocks, if its like that).

    As to hardware errors though, I don't exactly agree, because its much
    more likely to get a hardware error in a sha1 dedup than in one based on crc32,
    just because sha1 computing puts significantly more load on cpu.
    Yes, but you're trading one kind of errors for another and I think everybody agrees that error rates on SHA* of modern computers are low enough for any use (at least as long as you use the right computer for the right job), so you're taking a no-problem and saying you can do better. If some algorithmic error ratio is significantly lower than hardware error ratio, then it's perfectly OK and nobody will care if your is lower, but you haven't made a careful evaluation of whether it really is this low.


    And what's the "theoretical limit of collision resistance"?
    I'm pretty sure that if I'd try to enumerate all files of length N,
    there'd be exactly 256^N/2^32 collisions among their crc32s, and
    you can't do better with a 32-bit hash.
    Like Bulat pointed out, in real world distribution of files of length N is not uniform, so things should be tested on real world data. For me the theoretical limit is performance of a random function, like well estimated by Bloom tests.

    Quote Originally Posted by Bulat Ziganshin View Post
    m^2, i recommend you to start with reading http://samba.org/~tridge/phd_thesis.pdf - both fast rolling hash and sha1 is required for deduplication algorithm. but i believe that any reasonable algorithm, including fcg, crc and sha would be pretty close in collisions detection - the more important side is the number of bits in the hash. srep computes 64-bit fcg hash that's enough to detect most of collisions. but i've seen some (10-100) collisions on multigigabyte file and afair it was problem of handling zeroes. nevertheless, i use sha1 because it's fast enough, computed in separate thread and 100% ensures data integrity. but if you have fallback solution (i.e. full download in the case when deduplicated data fails CRC check), simple 32-bit hash would be enough
    I will look into it later, thx.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Replacing hash with a stronger one in your code is faster and much
    > simpler solution than adding ECC and some fallback mechanism.

    I'd say not faster (we only need some block xors to compute the recovery data)
    and, counting by source lines, not simpler either.

    Also for an engine that's supposed to process petabytes of data, I'd say
    its necessary to have a collision workaround anyway.

    http://www.schneier.com/blog/archive...a1_broken.html

    Imho there's no sense to be stubborn about it - a faster solution which
    doesn't rely on your trust in hashes is clearly better, even if its
    more complicated.

Page 1 of 2 12 LastLast

Similar Threads

  1. RZM - a dull ROLZ compression engine
    By Christian in forum Forum Archive
    Replies: 178
    Last Post: 1st May 2008, 21:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •