Results 1 to 8 of 8

Thread: Binary interleave for known-similar files greater than sliding window size?

  1. #1
    Member
    Join Date
    Sep 2015
    Location
    Internet
    Posts
    2
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Question Binary interleave for known-similar files greater than sliding window size?

    Hi!

    I have some several files that are quite similar with some minor differences in places.

    I want to compress them all together, but, the size of each individual file is greater than i can make 7zip/gzip's sliding window.

    So i thought, what if the binary files were interleaved (1MB of A, 1MB of B, ...) so that the sliding window can better find similar chunks.

    I began to write an interleave tool but for the sake of future decompression it would be best if there was a pre-existing format that handled this. The interleave would be handled in isolation (zero compression) and then compressed separately by a later tool.

    Is it possible to implement this kind of split-file binary interleave in e.g. the RAR, ACE, tar, or 7Zip bytestream formats?

    Does anyone have any suggestions about packing multiple large similar files together?

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    you need deduplication, f.e. zpaq

  3. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    mappu (19th September 2015)

  4. #3
    Member
    Join Date
    Sep 2015
    Location
    Internet
    Posts
    2
    Thanks
    1
    Thanked 0 Times in 0 Posts
    I tried an experiment with only three of my target files:


    A.zip 258MB
    B.zip 258MB
    C.zip 258MB


    7z_ultra(precomp(A.zip) + precomp(B.zip) + precomp(C.zip)) = 572MB


    7z_ultra(precomp(A.zip) + xdelta(precomp(A.zip), precomp(B.zip)) + xdelta(precomp(B.zip), precomp(C.zip))) = 223MB


    srep_best(tar(precomp(A.zip) + precomp(B.zip) + precomp(C.zip))) = 503MB


    zpaq(tar(precomp(A.zip) + precomp(B.zip) + precomp(C.zip))) = 339MB


    zpaq_method6(tar(precomp(A.zip) + precomp(B.zip) + precomp(C.zip))) = 232MB



    The zpaq approach produced the best filesize, but i think 7z+xdelta was faster although I didn't record exact timings.


    So, thank you for the suggestion, i will use zpaq for my project.

    ..

    I'm still interested as to whether or not it's possible for deflate compressors to reach this lower filesize by a kind of interleaving technique. I think it should be possible to reach this lower filesize by using a large enough dictionary, or by massaging the input data so that a large dictionary is not necessary. But even SREP -m3f -a1 was unable to identify the redundancy here.

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    1. deduplication is pretty like to LZ compression with full-file dictionary but only long matches. f.e. best srep mode should find almost any match longer than 512 bytes (-l option)

    2. srep should remove duplicates so srep(precomp(file)*N) == srep(precomp(file)). zpaq performs deduplication+compression, so its result is better. for dedup-only, you can try zpaq method 0 - it should be larger than srep. i suggested zpaq just because it's simpler to use

    3. of course you can do interleaving as far as you know how two (or more) files are correspond. dedup engine is something more general - i.e. it finds duplicates even if there are multiple repetitions or data chunks were mixed up

  6. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by mappu View Post
    I tried an experiment with only three of my target files:


    A.zip 258MB
    B.zip 258MB
    C.zip 258MB


    7z_ultra(precomp(A.zip) + precomp(B.zip) + precomp(C.zip)) = 572MB


    7z_ultra(precomp(A.zip) + xdelta(precomp(A.zip), precomp(B.zip)) + xdelta(precomp(B.zip), precomp(C.zip))) = 223MB


    srep_best(tar(precomp(A.zip) + precomp(B.zip) + precomp(C.zip))) = 503MB


    zpaq(tar(precomp(A.zip) + precomp(B.zip) + precomp(C.zip))) = 339MB


    zpaq_method6(tar(precomp(A.zip) + precomp(B.zip) + precomp(C.zip))) = 232MB



    The zpaq approach produced the best filesize, but i think 7z+xdelta was faster although I didn't record exact timings.
    Didn't 7z+xdelta actually produce the best file size? #2 looks like it's the smallest to me.

    So, thank you for the suggestion, i will use zpaq for my project.

    ..

    I'm still interested as to whether or not it's possible for deflate compressors to reach this lower filesize by a kind of interleaving technique. I think it should be possible to reach this lower filesize by using a large enough dictionary, or by massaging the input data so that a large dictionary is not necessary. But even SREP -m3f -a1 was unable to identify the redundancy here.
    You could interleave different files to help LZ, as long as you did it carefully. I think the right kind of interleaving would depend heavily on the particular dataset, so it would be hard to generalize and make automatic. (I realized that Bulat said essentially the same thing in his #3).

  7. #6
    Member
    Join Date
    May 2013
    Location
    ARGENTINA
    Posts
    54
    Thanks
    62
    Thanked 13 Times in 10 Posts
    Hi mappu. I have the curiosity to know the steps to compress with 7z xdelta. I know xdelta it's for make patch but i can't understand how this benefits the compression ratio. Can you tell me how to do that.Thank you.

  8. #7
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    [deleted]
    Last edited by Lucas; 23rd September 2015 at 01:27. Reason: redundant

  9. #8
    Member
    Join Date
    Jan 2013
    Location
    USA
    Posts
    28
    Thanks
    0
    Thanked 2 Times in 2 Posts
    What about interleaving the files on a byte-by-byte or bit-by-bit basis, if they really are similar?

Similar Threads

  1. Strategy for large but similar files
    By taurus in forum Data Compression
    Replies: 22
    Last Post: 19th April 2014, 03:26
  2. Haruyasu Yoshizaki's LZHUF 32-bit Compile (Or Similar)?
    By comp1 in forum Data Compression
    Replies: 8
    Last Post: 15th March 2014, 17:45
  3. Binary Size-Optimization Algorithm Idea
    By paqfan in forum Data Compression
    Replies: 11
    Last Post: 19th March 2012, 16:41
  4. which tool has the samllest size/orignal size * time
    By l1t in forum Data Compression
    Replies: 7
    Last Post: 27th August 2010, 06:10
  5. bzip2 dictionary size
    By Wladmir in forum Data Compression
    Replies: 3
    Last Post: 7th April 2010, 16:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •