I would like to propose you guys a challenge. Let's see how much can we squeeze from a certain corpus, while attaining to some restrains...
I have this set of files consisting in a collection of drivers I downloaded some years ago in an attempt to fix my old XP. When I tried to compress it I realised it is a very good corpus to practice on. Or show off your skills
What is so special about this set? Well, if the goal is to achieve a good, practical result, to back it up or to share it to someone, the process of compression has to take into consideration some fields inside data compressing theory. And not all of them are fully implemented yet... So I think we can do some magic here. There are some tips:
1) Per-file deduplication
31.27% of the corpus consist in some files being an identical copy of some other files elsewhere in the folder. Since checking this is so much faster than just try to compress them altogether, and probably lead to lower resources usage and better ratios, it's very likely a must-have feature in the packer of choice.
2) Executable preprocessing
Being a folder filled with drivers, you can immediately understand the importance of having a good x86 filter.
3) Bitmap lossless compression
8.73% of the corpus is in the form of uncompressed still images and animations, as resources inside the executables, that is, beside the items extracted using the next technique:
38.29% of the data is LZX-compressed (Microsoft Lempel-Ziv eXtended). This is an old algorithm which is not very good at compressing nor fast at decompressing either, and all the files are packed one by one separately (*.??_). In a quick test, I was capable of doing about 25% best (7.65% (extracted raw + 7z -m9x) vs 32.58% (original LZX-compressed + 7z -m9x)). Or what is the same, I saved an extra 277 MiB.
CHM compiled help files are complete web pages with JPEG and GIF images and all, compressed with LZX too.
LZX can be found also in Cabinet files, MSI installers and disk images.
5) Automatic data analysis
PE executables are containers to other types of data more than machine code. One example is all those bitmaps. But there is more also, like (some fonts and) about 29 mb of PCM audio. Part of the corpus is plain text INI files. Of course, the packer program can't give them a proper treatment if it doesn't know where or what they are.
6) Smart sorting
Is impressive how much are affected both ratio and resources usage depending on how files are sorted. Is it by name the best choice? By size? By their extension, maybe? Let's find out
The idea is to produce a single compressed file using a practical method. That method can be anything from a single custom-made executable doing the whole process to a batch using external tools or whatever you can imagine, really. The "practical" term implies the following conditions:
A) Use only asymmetric algorithms. The exceptions could be fast symmetric algos like PPMd and some dedupe implementations, or highly specific recompressors like packJPG.
B) Limit the RAM usage on decompression. I think 1.5 gb should suffice.
C) Avoid as far as possible the use of temporary files. It is not forbidden but is undesirable as it slows down the whole thing.
D) Have a straightforward way to unpack it. Preferable an SFX stub or batch with all the programs involved included (The unpacker size is not taken into account so if you are using hard-coded dictionaries, please include them into the archive).
E) Be completely lossless, bit by bit. This is mainly because of the bitmaps.
Why am I doing this? Well, for the fun on it, of course, but also because, as another user said, data compression is all about trade-offs. We have seen how much is possible to improve the ratios by adding monstrous amounts of RAM into the equation. Also, we definitely know that the more CPU cycles a process use, the better it can do. Which is completely OK, of course. Now, being capable of accomplish stunning results with limited resources is a completely different animal, and I think this is where the real art lives, optimisation. Like we use to say: "Más vale maña que fuerza"... "Brains over brawn".
The results will be ordered in a sheet using the formula savings*decompression_speed. So, the stronger and faster will be on top. If you use a strong but slow method, your entry will be automatically low ranked. The same applies to weak but fast methods. I'll measure decompression on my system three times and the fastest one will be chosen. Maybe is a good idea someone else do the timings on another PC too.
Right now, I'm measuring the first six, ordered by size, smaller first: Nanozip -cO, Freearc -mx -s, 7z Radyx LZMA2, CSArc, UHArc and 7z LZMA2.
EDIT: Uploading again.
A small word of warning: I'm not aware of any malware inside the package but it's full of executable files, and the source is anonymous... So please don't double-click any of them. Just to be sure. Thank you.
And have fun!