Results 1 to 6 of 6

Thread: Can you pack me?

  1. #1
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts

    Cool Can you pack me?

    I would like to propose you guys a challenge. Let's see how much can we squeeze from a certain corpus, while attaining to some restrains...


    I have this set of files consisting in a collection of drivers I downloaded some years ago in an attempt to fix my old XP. When I tried to compress it I realised it is a very good corpus to practice on. Or show off your skills
    What is so special about this set? Well, if the goal is to achieve a good, practical result, to back it up or to share it to someone, the process of compression has to take into consideration some fields inside data compressing theory. And not all of them are fully implemented yet... So I think we can do some magic here. There are some tips:


    1) Per-file deduplication


    31.27% of the corpus consist in some files being an identical copy of some other files elsewhere in the folder. Since checking this is so much faster than just try to compress them altogether, and probably lead to lower resources usage and better ratios, it's very likely a must-have feature in the packer of choice.


    2) Executable preprocessing


    Being a folder filled with drivers, you can immediately understand the importance of having a good x86 filter.

    3) Bitmap lossless compression


    8.73% of the corpus is in the form of uncompressed still images and animations, as resources inside the executables, that is, beside the items extracted using the next technique:

    4) Recompression


    38.29% of the data is LZX-compressed (Microsoft Lempel-Ziv eXtended). This is an old algorithm which is not very good at compressing nor fast at decompressing either, and all the files are packed one by one separately (*.??_). In a quick test, I was capable of doing about 25% best (7.65% (extracted raw + 7z -m9x) vs 32.58% (original LZX-compressed + 7z -m9x)). Or what is the same, I saved an extra 277 MiB.
    CHM compiled help files are complete web pages with JPEG and GIF images and all, compressed with LZX too.
    LZX can be found also in Cabinet files, MSI installers and disk images.

    5) Automatic data analysis


    PE executables are containers to other types of data more than machine code. One example is all those bitmaps. But there is more also, like (some fonts and) about 29 mb of PCM audio. Part of the corpus is plain text INI files. Of course, the packer program can't give them a proper treatment if it doesn't know where or what they are.

    6) Smart sorting


    Is impressive how much are affected both ratio and resources usage depending on how files are sorted. Is it by name the best choice? By size? By their extension, maybe? Let's find out



    The idea is to produce a single compressed file using a practical method. That method can be anything from a single custom-made executable doing the whole process to a batch using external tools or whatever you can imagine, really. The "practical" term implies the following conditions:


    A) Use only asymmetric algorithms. The exceptions could be fast symmetric algos like PPMd and some dedupe implementations, or highly specific recompressors like packJPG.


    B) Limit the RAM usage on decompression. I think 1.5 gb should suffice.


    C) Avoid as far as possible the use of temporary files. It is not forbidden but is undesirable as it slows down the whole thing.


    D) Have a straightforward way to unpack it. Preferable an SFX stub or batch with all the programs involved included (The unpacker size is not taken into account so if you are using hard-coded dictionaries, please include them into the archive).


    E) Be completely lossless, bit by bit. This is mainly because of the bitmaps.


    Why am I doing this? Well, for the fun on it, of course, but also because, as another user said, data compression is all about trade-offs. We have seen how much is possible to improve the ratios by adding monstrous amounts of RAM into the equation. Also, we definitely know that the more CPU cycles a process use, the better it can do. Which is completely OK, of course. Now, being capable of accomplish stunning results with limited resources is a completely different animal, and I think this is where the real art lives, optimisation. Like we use to say: "Más vale maña que fuerza"... "Brains over brawn".


    The results will be ordered in a sheet using the formula savings*decompression_speed. So, the stronger and faster will be on top. If you use a strong but slow method, your entry will be automatically low ranked. The same applies to weak but fast methods. I'll measure decompression on my system three times and the fastest one will be chosen. Maybe is a good idea someone else do the timings on another PC too.
    Right now, I'm measuring the first six, ordered by size, smaller first: Nanozip -cO, Freearc -mx -s, 7z Radyx LZMA2, CSArc, UHArc and 7z LZMA2.


    EDIT: Uploading again.


    A small word of warning: I'm not aware of any malware inside the package but it's full of executable files, and the source is anonymous... So please don't double-click any of them. Just to be sure. Thank you.


    And have fun!
    Last edited by Gonzalo; 4th December 2016 at 00:34. Reason: Trouble with the original link

  2. #2
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    149
    Thanks
    30
    Thanked 59 Times in 35 Posts
    Can not access the corpus.

  3. #3
    Member snowcat's Avatar
    Join Date
    Apr 2015
    Location
    Vietnam
    Posts
    27
    Thanks
    36
    Thanked 11 Times in 8 Posts
    Somehow Google deleted your corpus...

  4. #4
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Thanks. It's not deleted but I don't know why it says about an infringement. Maybe because Google doesn't know how to analyze a Nanozip archive. I'm working on it.

  5. #5
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    That's it! While I'm still waiting for a response from Google Drive support team, I uploaded the file again, this time to MEGA.nz:

    Nanozip, smaller size:
    https://mega.nz/#!4pxXWboR!fuUg7MrBqBQ8Lffg6ki_lOCD2oYxIc4La226EuZ 4vhY

    7-Zip, faster decompression:
    https://mega.nz/#!spgXTBpT!6XCvo9slYJCUdPxjx0Ru2_SpoC1kw0oCRHyMupj f6P4
    Last edited by Gonzalo; 5th December 2016 at 12:30.

  6. #6
    Member
    Join Date
    Apr 2009
    Location
    here
    Posts
    202
    Thanks
    165
    Thanked 109 Times in 65 Posts
    any results here?

Similar Threads

  1. The old pack format
    By willvarfar in forum Data Compression
    Replies: 1
    Last Post: 23rd December 2015, 17:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •