Results 1 to 23 of 23

Thread: Strategy for large but similar files

  1. #1
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Unhappy Strategy for large but similar files

    Hello!

    I need some help with choosing an effective compression strategy. I have daily plain text files (actually numbers), about 200 MB each. The important thing is, those text files are very similar: each day I also get a file that contains only the differences between the actual and the previous file, and it weighs anywhere from 50 to 600 kB. TZhe main files have about 4.000.000 lines each, and only a few thousand lines, at most get changed daily. I keep telling myself that there must be a way to compress, say 10 such files much more effectively than I actually do, which is about 33% compression ratio. But pure logic says that there should be a way to stuff 10 such files into no more than 250MB. I have tried various programs and also various methods, but I always end up with similar results.

    Edit: This is how the lines actually look like. I only ROTed the numbers and letters for privacy reasons.
    655555578:55752:63:1620174:34617551900:4766063:
    655555597:55667:63:78774802:08868799942:2197771:
    655555502:55667:63:75090768:89310776194:3869471:
    655555516:A:55656:63:6097186:12713294739:2197796:
    655555521:A:55656:63:7576956:61922125732:2270110:
    655555535:F:55660:63:78643475:19341759784::
    655555540:A:55656:63:67293649:56593295000::
    655555653:A:55656:655:77484044:46062352350:6799564 5:
    655555667:A:55656:63:75009331:38757290270::
    655555691:55667:63:60232725:14848969760:2278624:
    655555610:F:55660:63:78655625:73840331325::
    655555639:A:55656:63:78060090:07317468112:68837175 :
    655555644:H:55969:63:65697232:08296983285:4416914:
    655555756:J:55614:63:77486040:82966772276::
    655555761:H:55962:94:66153985:35237485700:4416912:
    655555775:A:55656:27:78925130:34697767202::
    655555709:A:55616:63:78081091:00166837164::
    655555714:H:55965:63:77831054:96938394837:65526954 :
    655555733:55667:63:2650966:48816497398:4652662:
    655555747:F:55660:63:78672710:39138972773:2444481:
    655555884:F:55660:63:78679322:67099886035::
    655555817:B:55688:63:61586969:77564948142:2225089:
    655555822:A:55656:63:78065549:80661911001:2111383:
    655555836:A:55685:63:78696477:80405031951:65761103 :
    655555841:A:55656:63:69102225:67386830987:65612148 :
    655555987:55667:669:98306:80001179669::
    655555906:55667:63:69331923:88897986579::
    655555911:A:55656:63:1139462:32727893217::
    655555925:A:55656:63:61175022:81343687341:2476484:
    655555057:A:55656:63:1300239:09524011122:4454442:
    655555062:A:55656:63:274463:85259129388:4465552:
    655555081:55667:669:992503:32960673978:4260697:
    655555029:55667:63:3303981:98504362108:3205139:
    655555034:A:55656:63:414194:15712656590:68875961:
    Last edited by taurus; 16th April 2014 at 12:59.

  2. #2
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    did you try shar + srep ?
    7-zip with a dictionary size of 256mb or more?

    Are you aiming for fast backup or fast restore?

  3. #3
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I haven't documented previous attempts. Here is my latest try based on your suggestion:
    7zip 9.20
    i only compressed two files, about 4.300.000 line each, only about 14.000 lines are different.
    compression level: normal14
    method: PPMd (LZMA max dic size is 64MB)
    dic: 256MB
    word size: 6
    solid block size: solid

    The result is just slightly better than average, about 29% compression.
    I expect such a ratio if i only compress a single file, but with the addition of the second and every next file the ratio should decline drastically.
    I have not tried shar+srep, I guess that would append the files to each other in some way?
    Fast backup is more important, I plan to add files to the archive daily, and only decompress them occasionally as needed. I was thinking to organise the files into monthly archives, in order to keep the size down and more manageable.

  4. #4
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    Yeah Shar gather the files together into one file since srep only works one a single file at a time.
    7-zip and SRep i belive are more aimed for fast restore. but even said that srep is pretty fast


    Her is the issues Im seeing with your feedback. you are running 7-zip on a 32bit os which hinders the usage of big dictionary size with lzma

    The real expert in her can better explain this but ima try my best to my understanding:
    LZ based compression look for repeated patterns i with a ranged of the dictionary size.
    What you are aiming for is finding repeated in two files that are nearly identical but at 250mb size
    That means the repeated start of file 1 and 2 is 250mb away. With only 64mb a view is not able to see the identical parts of both files as they are simply to far away.

    you need to get that dictionary size up to 256mb or better to archive the results you are looking for (finding the identical parts of different files).
    also make sure you are doing a solid compression, otherwise a search of repeated patter ends and reset between files.

    7-zip lzma need around 2.7gb of ram to work on a 256mb dictionary size with default setting (which is to big for your 32bit os)
    Srep need at lot less as its optimized for big dictionary size with small memory usage and it might be your best bet wit your 32bit os

    another way is to split the files and compress the beginning block together so instead og going "file1/250mb then file2/250mb" you go "file1part1/32mb then file2part1/32mb then file1part2/32mb then file2part2/32mb" that way the identical part would be withing your 64mb dictionary size of 7-zip. however if the original files differ in size there is going to be a skew going down the parts and it will reduce the methods effectiveness.

    tbh i would just try out shar+srep as its probably the easiest way for you, will obtain he compression effect you are wanting and its pretty fast
    Last edited by SvenBent; 16th April 2014 at 15:50.

  5. #5
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    a helping start at he command line

    shar a - <file1> <file2> | srep - Test.shar.srep

    Also you can see one of the miracles srep did for my compression here
    http://encode.ru/threads/583-SRep-ma...chive-into-3gb
    Last edited by SvenBent; 16th April 2014 at 16:02.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Similar files have to be compressed together, like tar | gzip, or if using an archiver, compress in solid mode.

    If you use zpaq, it will deduplicate common data regions larger than about 64K even if you add them separately. You can use the -fragment option to reduce the fragment size like -fragment 0 (1 KB) or -fragment 2 (4 KB) to get better deduplication, but remember to use the same option every time you update the archive.

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Similar files have to be compressed together, like tar | gzip, or if using an archiver, compress in solid mode.
    The problem seems to be that the matches are so far apart that the sliding window doesn't catch them.

    Bulat's SREP seems to have been designed for exactly this scenario.

    The wikipedia page makes the implication that LZ with forward-pointing references inherently requires two passes, but wouldn't that be about the same as LZ done backwards?

    Quote from freearc.org:

    Future-LZ is a modification of LZ77 storing matches at the match source rather than destinatioon position. For compressor like SREP that utilizes only long matches it allows to decrease amount of data stored in the decompressor, therefore decreasing amount of memory required for decompression. In my tests, decompression required RAM equal to about 10% of filesize. Moreover, since we know order of access to these stored data, they may be swapped from RAM to diskfile without losing efficiency, so decompression may be performed using just about 100 mb of RAM.
    Last edited by nburns; 17th April 2014 at 05:17.

  8. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    At risk of trivializing this problem: what do you get back if you take two of these files and diff them? If diff works and doesn't choke, you can reduce your data to mostly patch files, and you can compress those.

  9. #9
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by SvenBent View Post
    a helping start at he command line

    shar a - <file1> <file2> | srep - Test.shar.srep

    Also you can see one of the miracles srep did for my compression here
    http://encode.ru/threads/583-SRep-ma...chive-into-3gb
    Good morning.

    What is the "a - " supposed to do? Shar does not recognise the syntax and is looking for a file named accordingly. The argument -a wants it's partner -n and I don't seem to be needing either.
    I get interesting results. If I only process a single file, I get a test.shar.srep that is the same size as the input file. I also get some strange dependency errors midway through the process:
    shar 15 16 | srep - Test.shar.srep
    SREP 3.2 (April 6, 2013): input size 25600 mb, memory used 1744 mb, -m1f -l512 -
    c512 -a4
    [...]
    0%: 184,549,376 -> 184,550,008: 100.00%. Cpu 28 mb/s (6.193 sec), real 20 mb/s
    0%: 192,937,984 -> 192,938,644: 100.00%. Cpu 29 mb/s (6.443 sec), real 20 mb/s
    0%: 201,326,592 -> 201,327,280: 100.00%. Cpu 29 mb/s (6.646 sec), real 21 mb/s
    (9.257 sec) = 72%The system cannot find the file specified.
    0%: 207,399,560 -> 207,400,276: 100.00%. Cpu 29 mb/s (6.848 sec), real 13 mb/s
    (15.622 sec) = 44%
    Second pass: 100%

    I am willing to think ahead and conclude that it doesn't even try to compress the first file?

    If I parse two files, I get the same error, but also a file that is more or less like (original file size)+(file size of differences × 10), which is, by itself, almost perfect.
    However I am worried about the error message and also need to do more testing, especially with real life scenarios. In that regard, the lack of alternate software options, even just for viewing the archive contents or seeking a certain file, is a bit of a problem: I may move on to other things later and need something that is easier to handle for non-IT people who may take this over.
    That said, so far the results are amazingly fast and also very efficient.
    Last edited by taurus; 17th April 2014 at 09:44.

  10. #10
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Similar files have to be compressed together, like tar | gzip, or if using an archiver, compress in solid mode.

    If you use zpaq, it will deduplicate common data regions larger than about 64K even if you add them separately. You can use the -fragment option to reduce the fragment size like -fragment 0 (1 KB) or -fragment 2 (4 KB) to get better deduplication, but remember to use the same option every time you update the archive.
    My attempt with zpaq probably needs mor fine tuning. I used a simple zpaq a aaa.zpaq 15 16

    C:\Winapp\winzpaq>zpaq l aaa.zpaq 15 16
    Using 2 threads
    Reading archive aaa.zpaq
    1 2014-04-16 09:08:04 .A.... 207396610 15
    1 2014-04-16 09:08:14 .A.... 207398221 16
    2 files. 414794831 -> 176136997

    Version Date Time (UT) +Files -Deleted Original MB Compressed MB
    ---- ---------- -------- -------- -------- --------------- ---------------
    0 0 0 0.000000 0.000000
    1 2014-04-17 07:13:40 2 0 414.794831 176.136997
    0.02 seconds

  11. #11
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by nburns View Post
    At risk of trivializing this problem: what do you get back if you take two of these files and diff them? If diff works and doesn't choke, you can reduce your data to mostly patch files, and you can compress those.
    I gave fc a try. It's fast, and under other circumstances I would maybe go that way. For the solution to be broad-user-base-proof though, I was looking for something that would be more firendly to the average windows-user, *ideally* something like a plugin for total commander, where anyone could enter the archive as a directory, the quickly hit F3 twice to get the search dialogue. If I tell them to first check the reference file, then decompress the diffs, then seek inside the diff file, they will probably just give up.
    It seems that effectively archiving these files is not, as I was hoping, a matter of simply choosing the proper compression parameters. As I am responsible for designing, rather than using the workflow, I think I will just leave the uncompressed: a two year archive span requires less than 200GB of disk space. At least for now!

  12. #12
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Taurus,

    would be possible to obtain ~2-5 real files to test?

    Sincerely yours,

    FatBit

  13. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by nburns View Post
    The wikipedia page makes the implication that LZ with forward-pointing references inherently requires two passes, but wouldn't that be about the same as LZ done backwards?
    future-LZ store LZ match info (offset+length) at the match source position, and that can be done with backward LZ processing. but LZ also needs to eliminate repeated data at the destination position that cannot be done with backward processing

    nevertheless, srep 3.9 implemented new mode that does only 1 pass over the file. it just keeps full list of matches in the memory, and after file processing, sorts matches in the match_src order and stores them at the end of compressed file

    also, -l16 option ensures minimum size of compressed file if you don't further compress it. -l512 is optimized for subsequent lzma compression

  14. #14
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by taurus View Post
    I gave fc a try. It's fast, and under other circumstances I would maybe go that way. For the solution to be broad-user-base-proof though, I was looking for something that would be more firendly to the average windows-user, *ideally* something like a plugin for total commander, where anyone could enter the archive as a directory, the quickly hit F3 twice to get the search dialogue. If I tell them to first check the reference file, then decompress the diffs, then seek inside the diff file, they will probably just give up.
    It doesn't look like fc can be used to generate patchfiles, which was what I was envisioning. Unix diff has a counterpart called patch which can be used to reproduce original files from their diffs.

    It seems that effectively archiving these files is not, as I was hoping, a matter of simply choosing the proper compression parameters. As I am responsible for designing, rather than using the workflow, I think I will just leave the uncompressed: a two year archive span requires less than 200GB of disk space. At least for now!
    That sounds perfectly reasonable.

  15. #15
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by FatBit View Post
    Dear Mr. Taurus,

    would be possible to obtain ~2-5 real files to test?

    Sincerely yours,

    FatBit
    Dear FatBit, I really appreciate your offer and am tempted to oblige, however this is sensitive data. Maybe if I knew how to quickly ROT(n) (where n is a variable and hence very basic encription) 1GB of data, or even better randomize the numbers (in way a which would keep matches between files, which by definition contradicts randomization)? IDK.
    Last edited by taurus; 17th April 2014 at 13:03.

  16. #16
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    You can try http://libbsc.com/ with max. block size (after tar)?

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    What do you get when you compress the files separately. What about using a smaller fragment size?

    zpaq a archive 15
    zpaq a archive 16

    zpaq a archive 15 -fragment 0
    zpaq a archive 16 -fragment 0

    -fragment 0 means use an average fragment size of 2^0 KB for deduplication. Default is 6 (64 KB).

    You can get a summary of deduplication statistics with
    zpaq l archive -summary

    You may have an older version that does not support the -fragment option. Latest stable version is at http://mattmahoney.net/dc/zpaq.html

  18. #18
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Sensitive… Understood. I will "support" you different way. If I understand correctly, you have:

    Day 1 MainFile 1 (~200 MB, ~4 M lines)
    Day 2 MainFile 2 (~200 MB, ~4 M lines)
    DifFile1-2 (60-600 kB)
    Day 3 MainFile 3 (~200 MB, ~4 M lines)
    DifFile2-3 (60-600 kB)
    Day 4 MainFile 4 (~200 MB, ~4 M lines)

    etc. Is it true if you have MainFile 1 and all DifFiles, you are able to reconstruct all MainFiles #? Or is it possible to sort content of files before compression?
    I have on my install disk some programs for patching. If you wish, I will send you.

    Sincerely yours,

    FatBit

  19. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by taurus View Post
    Dear FatBit, I really appreciate your offer and am tempted to oblige, however this is sensitive data. Maybe if I knew how to quickly ROT(n) (where n is a variable and hence very basic encription)
    Probably not a good idea to create some kind of known-flawed encryption scheme. That might just make it an entertaining puzzle.

    1GB of data, or even better randomize the numbers (in way a which would keep matches between files, which by definition contradicts randomization)? IDK.
    What you'd likely want is some kind of hash, possibly a cryptographic one. That way you could recognize matches without being able to reconstruct the original numbers. There's always risk involved, though, especially when designing your own scheme.

  20. #20
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I did some testing on a 64bit system and here are the results:

    7zip, LZMA, 256MB dictionary, word size 32, solid archive:
    4 × 197,7 MB source files, compressed to (drum roll) 60,6 MB!!! 7,6% ratio!!!
    It did take almost 15 minutes on a FX-8350 though (using max. of 2/8 threads), and 2,7 GB of RAM!

    Interestingly, LZMA 2 fared much worse, with a standard 30% ratio, but finishing the job in just 2 minutes.

    Thank you for all your suggestions so far, I will post back, and please don't mind my drastic lack of time :/

    Also, LZMA compressed a single file to 60,3 MB (30%). Which means that it managed to stuff the differences of the remaining 3 files into just 300 kB.
    I think it is safe to say that this is as good as perfect!
    Last edited by taurus; 18th April 2014 at 22:26.

  21. #21
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    372
    Thanks
    26
    Thanked 22 Times in 15 Posts
    If you select lzma2 and only 2 threads like in lzma then the compression should be the same but maybe it is a little bit faster, did you tried zpaq should also do a good job.

  22. #22
    Member
    Join Date
    Apr 2014
    Location
    Croatia
    Posts
    8
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by thometal View Post
    If you select lzma2 and only 2 threads like in lzma then the compression should be the same but maybe it is a little bit faster, did you tried zpaq should also do a good job.
    Yes, after reducing to 2 threads, I get a file that is 16kB larger than the LZMA version. It took approx. 15 minutes also. I must say that I did not expect such close results!

    I think I have a winner in the GUI category. Now onto the CLI....

  23. #23
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    Quote Originally Posted by taurus View Post
    I did some testing on a 64bit system and here are the results:

    7zip, LZMA, 256MB dictionary, word size 32, solid archive:
    4 × 197,7 MB source files, compressed to (drum roll) 60,6 MB!!! 7,6% ratio!!!
    It did take almost 15 minutes on a FX-8350 though (using max. of 2/8 threads), and 2,7 GB of RAM!

    Interestingly, LZMA 2 fared much worse, with a standard 30% ratio, but finishing the job in just 2 minutes.

    Thank you for all your suggestions so far, I will post back, and please don't mind my drastic lack of time :/

    Also, LZMA compressed a single file to 60,3 MB (30%). Which means that it managed to stuff the differences of the remaining 3 files into just 300 kB.
    I think it is safe to say that this is as good as perfect!

    i told ya soo


    nevertheless in regards to shar +srep command line

    shar a - <file1> <file2> | srep - Test.shar.srep

    Code:
    Shar               - Starts the shar programn
    a                  - Tells the program it neades to add files into a single file
    -                  - destionation file.  the - tell it to use stdout which delivers data to the next program in chain insted of writing into a file
    <file1>  <file2>   - the source files it need to put together into one file
    |                  - Tells that we are doiogn a chain og program and the next program is next
    srep               - starts srep
    -                  - input fiel (yes its reversed of shar in order) the - tell it to use stdin (which is chained to stdout from shar)
    test.shar.srep     -  the destionation file

    i just used this without any issues you might wanna recheck you shar+srep test

    shar a - *.utx | srep64i - test.shar.srep

    it worked flawlesslys. however the output on the consol box will lookvery clutterrde because both shar and srep will write to the screen at the same time

    alternative you can just do the two command step by step but it will be slow since you are now making a temporay file on the HDD

    shar a test.shar *.utx
    srep test.shar test.shar.srep
    del test.shar

    good luck
    Last edited by SvenBent; 19th April 2014 at 03:28.

Similar Threads

  1. Haruyasu Yoshizaki's LZHUF 32-bit Compile (Or Similar)?
    By comp1 in forum Data Compression
    Replies: 8
    Last Post: 15th March 2014, 17:45
  2. Snappy Compression for large number of small files
    By Selvaraj in forum Data Compression
    Replies: 1
    Last Post: 30th March 2013, 23:43
  3. Replies: 33
    Last Post: 27th August 2011, 05:13
  4. Replies: 11
    Last Post: 20th February 2008, 11:39
  5. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 01:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •