Results 1 to 26 of 26

Thread: Precomp 0.4.7

  1. #1
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts

    Precomp 0.4.7

    Precomp 0.4.7 is out.

    List of changes:

    • Merge with the excellent preflate library to support recompression of all zLib streams - thanks to Dirk Steinke
    • Support lzma filters (-lf) for improved compression of executables, audio and structured data (see issue #75)
    • Support for zLib streams larger than 2 GB (see issue #65)
    • Changed to CMake for easier builds
    • Fixed crashes when running multiple instances of Precomp in the same directory (see issue #87)
    • Corrected -pdfbmp statistics (see issue #27)
    • Discard insufficient partial bZip2 matches (see commit 138c107) - thanks to Gonzalo Muñoz
    • Several minor bugs and memory leaks fixed (thanks to Gonzalo Muñoz and Dirk Steinke)


    Have a look at https://github.com/schnaader/precomp-cpp

    As the last release was over a year ago, I decided to just release it now. It has been stable in my latest tests and preflate support is a big improvement for lots of files (especially big archives and PNGs), e.g. silesia.zip from Matt's site had 5 files in it which could only recompressed partially, so the "-cn" result was 88 MB. The new version successfully decompresses all files and the result is 231 MB in size. Running "precomp silesia.zip" gives a lzma-compressed result of 47,122,779 bytes (not to be compared with official Silesia results because they process files seperately, but showing that Precomp can operate on the unextracted .zip file).
    Last edited by schnaader; 22nd February 2019 at 19:06.
    http://schnaader.info
    Damn kids. They're all alike.

  2. The Following 5 Users Say Thank You to schnaader For This Useful Post:

    elit (23rd February 2019),ffmla (16th March 2019),Mike (22nd February 2019),moisesmcardona (23rd February 2019),zeon (22nd February 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Code:
    Z:\017>precomp.exe -cn -v -intense book1_gz5.raw
    
    Precomp v0.4.7 Windows 64-bit - ALPHA version - USE FOR TESTING ONLY
    Free for non-commercial use - Copyright 2006-2019 by Christian Schneider
      preflate v0.3.5 support - Copyright 2018 by Dirk Steinke
    
    Input file: book1_gz5.raw
    Output file: book1_gz5.pcf
    
    Using packJPG for JPG recompression, packMP3 for MP3 recompression.
    --> packJPG library v2.5k (01/22/2016) by Matthias Stirner / Se <--
    --> packMP3 library v1.0g (01/22/2016) by Matthias Stirner <--
    More about packJPG and packMP3 here: http://www.matthiasstirner.com
    
    New size: 318255 instead of 318230
    
    Done.
    
    Z:\017>preflate_raw.exe -s book1_gz5.raw
    preflate v0.3.4
    splitting book1_gz5.raw successful (318230 -> 768771 + 13021)
    All ok
    What am I doing wrong?

  4. #3
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Shelwien View Post
    What am I doing wrong?
    I guess intense mode (zLib stream, 2 bytes header followed by deflate stream, RFC 1950) is not enough, try -brute (deflate stream, 3 bits header, RFC 1951):

    Code:
    C:\temp\shelwien\cdm_test_5>precomp -cn -v -brute book1__kzip.raw
    
    Precomp v0.4.7 Windows 64-bit - ALPHA version - USE FOR TESTING ONLY
    Free for non-commercial use - Copyright 2006-2019 by Christian Schneider
      preflate v0.3.5 support - Copyright 2018 by Dirk Steinke
    
    Input file: book1__kzip.raw
    Output file: book1__kzip.pcf
    
    Using packJPG for JPG recompression, packMP3 for MP3 recompression.
    --> packJPG library v2.5k (01/22/2016) by Matthias Stirner / Se <--
    --> packMP3 library v1.0g (01/22/2016) by Matthias Stirner <--
    More about packJPG and packMP3 here: http://www.matthiasstirner.com
    
    (0.00%) Possible zLib-Stream (brute mode) found at position 0
    Compressed size: 299437
    Can be decompressed to 768771 bytes
    Non-ZLIB reconstruction data size: 27312 bytes
    Recursion start - new recursion depth 1
    No recursion streams found
    Recursion end - back to recursion depth 0
    New size: 796118 instead of 299437
    
    Done.
    Time: 563 millisecond(s)
    
    Recompressed streams: 1/1
    Brute mode streams: 1/1
    
    You can speed up Precomp for THIS FILE with these parameters:
    -d0
    Note that this is a different .raw file from you I found on my disk. The "Non-ZLIB reconstruction data size" is the interesting indicator, of course.
    http://schnaader.info
    Damn kids. They're all alike.

  5. The Following User Says Thank You to schnaader For This Useful Post:

    Shelwien (22nd February 2019)

  6. #4
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    15
    Thanks
    2
    Thanked 1 Time in 1 Post
    Are these options:
    Code:
      lm[amount]   Set maximal LZMA memory in MiB <2048>
      lt[count]    Set LZMA thread count <auto-detect: 4>
      lf[+-][xpiatsd] Set LZMA filters (up to 3 of them can be combined) <none>
                      lf+[xpiatsd] = enable these filters, lf- = disable all
                      X = x86, P = PowerPC, I = IA-64, A = ARM, T = ARM-Thumb
                      S = SPARC, D = delta (must be followed by distance 1..256)
      llc[bits]    Set LZMA literal context bits <3>
      llp[bits]    Set LZMA literal position bits <0>
                   The sum of lc and lp must be inside 0..4
      lpb[bits]    Set LZMA position bits, must be inside 0..4 <2>
    used only for compression?
    Thanks.

  7. #5
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Yes, they have no effect when using them at decompression (-r). Memory used for decompression is ~ factor 10 less, as a rule of thumb. Also note that the used lzma library doesn't support multithreaded decompression, but this usually doesn't lower -r performance as recompression of the streams is the bottleneck (which will use multiple threads in later precomp versions).
    http://schnaader.info
    Damn kids. They're all alike.

  8. The Following User Says Thank You to schnaader For This Useful Post:

    pklat (25th February 2019)

  9. #6
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    15
    Thanks
    2
    Thanked 1 Time in 1 Post
    thanks.
    does it support lzx/cab/wim ?

  10. #7
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    No, (not yet). There is a recompressor that works for many files, but it's not open source yet and Windows only at the moment.
    http://schnaader.info
    Damn kids. They're all alike.

  11. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/precomp047.rar
    Code:
     95.828s:  precomp.exe -o1 testfile2.zip
     80.640s:  precomp047gcc82.exe -o1 testfile2.zip
     80.484s:  precomp047ic19.exe -o1 testfile2.zip
     78.157s:  precomp047cl90.exe -o1 testfile2.zip
    (testfile2.zip is a 400M zip archive of pdfs)

  12. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    Mike (27th February 2019),Simorq (6th March 2019)

  13. #9
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Shelwien View Post
    I would like to point out that these are 32-bit compiles. As these use less memory by default, "-lm2048" has to be used when comparing to 64-bit compiles to use the same LZMA block size for both. They might also run out of memory when using more than 2 (or 3) GB.
    http://schnaader.info
    Damn kids. They're all alike.

  14. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    All exes are x64, I just checked again.
    Also all outputs were exactly the same, I logged md5s of pcf files.
    Code:
     82.594s:  precomp.exe -cn -intense -o1 testfile2.zip
     80.390s:  precomp047gcc82.exe -cn -intense -o1 testfile2.zip
     80.141s:  precomp047ic19.exe -cn -intense -o1 testfile2.zip
     77.469s:  precomp047cl90.exe -cn -intense -o1 testfile2.zip

  15. The Following User Says Thank You to Shelwien For This Useful Post:

    Simorq (6th March 2019)

  16. #11
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Shelwien View Post
    All exes are x64, I just checked again.
    Ah, so as I understand it now, for your speed measurements you used 64-bit compiles and the .rar you posted contains 32-bit compiles.
    http://schnaader.info
    Damn kids. They're all alike.

  17. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    No, exes in archive are same x64 ones.

  18. #13
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Ah, perhaps they just report as 32 bit because the "BIT64" define isn't set. I was confused because of the output they give:

    Code:
    Precomp v0.4.7 Windows 32-bit
    http://schnaader.info
    Damn kids. They're all alike.

  19. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Ah, I didn't notice, normally __x86_64 is used for that.

    32-bit version is much harder to compile actually, I only succeeded with IntelC: http://nishi.dreamhosters.com/u/precomp047.rar
    With gcc... My gcc82/x64 somehow has working std::thread, but all others don't.
    I tried using the external workaround, but it wasn't enough.
    And clang built the exe, but it doesn't work (crashes while printing the copyright string).
    Code:
     95.781s:  precomp32.exe -cn -intense -o1 testfile2.zip
     87.766s:  precomp047ic19x32.exe -cn -intense -o1 testfile2.zip
     85.391s:  precomp047cl60x32.exe -cn -intense -o1 testfile2.zip
    Update:
    clang is fun: managed to compile a working binary with LLVMv6 in the end. Still faster, so everything's good?
    LLVMv7, same script: linker crashes
    LLVMv8-9, same script: looks like clang generates 64-bit pointers there, or some such. Any printf with a string var crashes.

  20. #15
    Member Dimitri's Avatar
    Join Date
    Nov 2015
    Location
    Greece
    Posts
    48
    Thanks
    21
    Thanked 30 Times in 14 Posts
    Congratulations on the new Precomp

    Here is a random file i had for testing


    Random Test and Results

    Input Size = 1.13GB zlib File
    Tools Used = Xtool v9 x64 vs Precomp v047 x64

    Xtool 1 Thread usage speed = 7.84mb/sec
    Xtool 4 Thread usage speed = 17.90mb/sec
    Precomp 1 Thread usage speed = 2.24mb/sec

    Precomp Pros = ability to detect jpg,mp3 streams
    xtool Pros = Multithreading,can simultaneously recompress lz4,zstd,oodle,zlib,crilayla,lzo

    Testing Enviroment
    Win 8.1
    i3 3220
    RAM 6GB
    Raid0 - 2HDD Sata3 Protocol

    For more info look at the print screen
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	Capture.PNG 
Views:	132 
Size:	750.1 KB 
ID:	6486  

  21. #16
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    15
    Thanks
    2
    Thanked 1 Time in 1 Post
    it would be nice to add decryption/encryption.
    for PS3 pkg files, there are utils that can decrypt them ( 'TrueAncestor' ). dunno if they encrypt to the exact same file, though.

  22. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Schnaader, what do you think about feasibility of partial recompression for large-dictionary codecs? (lzma,lzham,lzx,zstd)

    The idea is to only remove entropy coding (which is usually easy to do losslessly), while converting a compressed stream to unpacked tokens.
    Of course, it won't improve compression directly, even if we'd compress tokens with a strongest general-purpose codec.
    But I think it can be an interesting option for dedup.

    Like, we can produce an alternate stream with lzma tokens for all input data (by compressing it),
    then also remove entropy coding from detected lzma streams and look for matches,
    both within main (for literal strings) and alternate stream.

  23. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (14th March 2019)

  24. #18
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    164
    Thanks
    62
    Thanked 13 Times in 9 Posts
    Maybe Wimlib can help with adding recompression support for Windows .wim images and other files using LZX and LZMS? https://wimlib.net/

  25. #19
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Shelwien View Post
    Schnaader, what do you think about feasibility of partial recompression for large-dictionary codecs? (lzma,lzham,lzx,zstd)
    Yes, I agree that partial recompression (entropy only) is a way to go. It has several advantages:
    - decompressed data is known, so it can be added to a "dictionary stream" and used later (e.g. if the same or similar data is compressed elsewhere)
    - same for tokens and "alternate" tokens, as you described
    - it's much easier to implement than full recompression (entropy + reconstruction of tokens) => better format support, especially in the ever-growing domain of LZ compressors
    - only restoring entropy coding is much faster
    - we get the additional information (decompressed data + tokens) even if we decide to leave the compressed stream unchanged, though we still have to invest the CPU cycles for entropy decoding it on the way back
    http://schnaader.info
    Damn kids. They're all alike.

  26. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @moisesmcardona:

    Somewhat, but its not so simple.

    1) Afaik, wimlib compression implementation is different from cabarc etc. Recompression needs exact match.

    2) reflate-like recompression for large-window codecs is impractical:
    - see lzmarec detection speed (it does 9*5*5=225 decoding attempts per input data byte);
    - encoding with large mc is not so rare, so ideally we'd have to consider all matches within the window (note that lzma supports 2-byte matches, and can in theory choose any);
    - need a way to detect fb,mc;
    - "a","mf" are totally out of luck;
    - official lzma has 5 or so different parser versions;

    3) precomp-like (well, older precomp) recompression is possible sometimes
    (detect/extract streams with lzmadump, find a matching 7z version, find matching options, unpack/compress the streams, diff with original), but:
    - analysis would be also very slow;
    - analysis has to be multi-pass so not really streamable, unfit for archivers;
    - unpacking recompressed data would involve (at least!) basic lzma encoding, which can be 1Mb/s or so;
    - recompression would only work sometimes (usually possible for games, since they'd use same encoder/options for all data; but not for eg. *.7z collection);

    This is all kinda true for LZX as well, but for LZX we'd also need to reverse-engineer encoder versions.

    So I'm proposing a partial recompression method, which can at least take care of dups in different streams.

  27. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @schnaader:

    Working with "virtual" decompressed data can potentially turn it into full recompression if there's a good match with encoder -
    the benefit of this approach is that it can avoid redundancy, which recompression preprocessing currently adds.

    Problem is, we need a multi-stream LZ coder for this :)
    Btw, its also compatible with different kinds of preprocessing, like utf16<->utf8.
    Atm we can't do it unconditionally since it has a high chance of hurting compression,
    but with support for alternate-stream matches there should be a good effect.

    I wonder what other types of preprocessing would be compatible?..
    Makes sense to apply it to deflate too, I guess.

  28. #22
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Two similar use cases (other than LZ*) that come to my mind:

    PDF: Words in there are often fragmented, so creating an "unfragmented" text stream would help, e.g. if a PDF is compressed together with it's epub version (where things are less fragmented).

    From FlashMX.pdf:
    Code:
    "(Dy)6.3(nam)6.8(i)7.1(c G)27.4(r)6.3(aph)7.5(i)7.1(cs)" => "Dynamic Graphics"
    DXT: Decompressing image data from it is easy, but transformations of the original data or recompressions of the decompressed data do not help. Also see Precomp issue #53 and this blog article.
    http://schnaader.info
    Damn kids. They're all alike.

  29. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    PDF probably needs explicit preprocessing... unfortunately its too complex.

    But as to DXT, I think my idea with steganography applies to it.
    I mean, we should be able to make a reversible DXT decoder with a quantization map for it.
    Then we'd get a bmp with steganographic payload, which can be normally compressed with FLIF or something.
    Its a universal approach for all lossy compression methods.

  30. #24
    Member
    Join Date
    Oct 2014
    Location
    South Africa
    Posts
    38
    Thanks
    23
    Thanked 7 Times in 5 Posts
    @schnaader,

    Today I tried to compress and old XLSX file and I got interesting output!!

    I think the file had been created by Office 2010. Precomp046 and lower could decompress the file almost without adding any noticable corrections. The size was exactly same as the decompressed XLSX file (by 7-zip); but the output of v047 was bigger. As a result, the 046 file could be compressed better than 047 file.

    Code:
    Original size: 20,502,859
    Decompressed: 98,272,373
    Precomp046: 98,276,029 , ZIP streams: 12/12, -zl13,14,17 -d0
    Precomp047: 98,786,845 , ZIP streams: 12/12, -d0
    I tried -brute option too for v047 but also got -d0 and same file size.

    The story is different with the XLSX saved by Office 2013, for which v047 could find all the streams but v0.46 failed (only found 4 out of 12 streams).

    Is it possible to fix this issue for old XLSX files? I can send you the file if you need it.

    Thanks

  31. #25
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by msat59 View Post
    Is it possible to fix this issue for old XLSX files? I can send you the file if you need it.
    Interesting. Yes, please send me the file, it's a good test file and it might be possible to improve the situation - though it's not really possible to "fix" it as it is a result from changing to preflate in 0.4.7. Some explanation first: The old 0.4.6 version brute forced zLib parameters and had to store only these. This could be very slow and some files could not be processed by it. The new version uses preflate which fixes the latter but stores reconstruction data for each zLib stream. Both non-zLib streams and "unsual" zLib streams as the ones you found (The zLib levels 13, 14, 17 that Precomp 0.4.6 found are rather ununsual) have more reconstruction data than others. Usually, the difference to 0.4.6 isn't that big and is compensated by all the streams that 0.4.6 couldn't recompress. For example, see FlashMX.pdf, where resulting file size is worse, but difference is below 1%:

    Code:
    	0.4.6		0.4.7
    -cn	26.932.960	26.936.513
    -cl	2.752.903       2.759.271
    Using "-v", you can see how big the reconstruction data for each stream is, see the example below. If one of the streams in your file has unusual big reconstruction data, you could ignore it's position using "-i", but my suspicion is that they all have some reconstruction data, so this won't work well (every time you ignore a stream, you also lose the gains from decompressing it).

    Code:
    (3.87%) Possible zLib-Stream in PDF found at position 174942
    Compressed size: 418463
    Can be decompressed to 1247832 bytes
    Non-ZLIB reconstruction data size: 977 bytes
    A real "fix" would be to use the old zLib code from 0.4.6 together with the new preflate code and only use preflate if the old code doesn't find a match. I decided against this in 0.4.7 because it would be slow and ratio would only be significantly better for some rare corner case files as the one you found (your file is the most extreme I saw so far).
    http://schnaader.info
    Damn kids. They're all alike.

  32. #26
    Member
    Join Date
    Oct 2014
    Location
    South Africa
    Posts
    38
    Thanks
    23
    Thanked 7 Times in 5 Posts
    I've messaged you the download link of samples.

    I used -v for v0.47.

    Code:
    (0.03%) Possible zLib-Stream in ZIP found at position 5259Compressed size: 20434601
    Can be decompressed to 97959201 bytes
    Non-ZLIB reconstruction data size: 511029 bytes
    There is nothing similar in v0.46.

    Code:
    (0.03%) Possible zLib-Stream in ZIP found at position 5259Compressed size: 20434601
    Can be decompressed to 97959201 bytes
    Identical recompressed bytes: 20434592 of 20434601
    Identical decompressed bytes: 97959201 of 97959201
    Best match with level combination 17, windowbits = 15: 20434592 bytes, decompressed to 97959201 bytes
    Recursion start - new recursion depth 1
    No recursion streams found
    Recursion end - back to recursion depth 0
    Penalty bytes were used: 5 bytes

Similar Threads

  1. Precomp 0.4.3
    By schnaader in forum Data Compression
    Replies: 13
    Last Post: 22nd February 2014, 14:26
  2. Precomp 0.4.2
    By schnaader in forum Data Compression
    Replies: 31
    Last Post: 13th August 2012, 12:01
  3. Precomp 0.4.1
    By schnaader in forum Data Compression
    Replies: 36
    Last Post: 7th October 2011, 16:36
  4. Precomp 0.4
    By schnaader in forum Data Compression
    Replies: 190
    Last Post: 5th October 2010, 15:13
  5. Precomp (and Precomp Comfort) in 315 kb
    By Yuri Grille. in forum Data Compression
    Replies: 2
    Last Post: 1st April 2009, 19:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •