Deflate/Gzip compression is no longer a black box!
The son of defdb and pngthermal is finally there, defdb is a good tool to scrutinize Deflate compressed data at a very low level but it doesn't give the "big picture", gzthermal tries to make it more visual and accessible to the average users (more features will be added over time to help distinguish literals from LZ matches and display LZ matches boundaries).
gzthermal: displays compression efficiency heatmap of gzipped files.
Version 0.4b (23 May 2014) by Frederic Kayser
Usage: gzthermal [-e|-s|-m|-l] [-n|-w] [-b] [-z] [-g] file.gz
Output file is a PNG image called "gzthermal-result.png"
Options: e, s, m, l extra small, small, medium or large text output
n, w narrow or wide text lines (default is in between)
b binary/hex mode (small text size equivalent only)
z bicolor (distinguish LZ matches from literals)
g grayscale mode (may help colorblind people)
A previously non documented option is the compression level that can be set between 0 and 9 simply by adding a figure, for instance "gzthermal -ew9" should produce smaller output PNG files than "gzthermal -ew" (default compression level is 4).
Regarding rendered text line width the default is 66 symbols, narrow 48 and wide 84.
The input file has to be already compressed using gzip (or zopfli, kzip+kzip2gz, 7-zip/7za... it could also have been retrieved from a web server with gzip compression enabled), the resulting image file is usually quite big and may prove difficult to handle on resource limited computers.
Currently non-printable ASCII characters are replaced by a small square, i.e. end of lines made of CR LF or just LF, TAB and Unicode code points that require 2, 3 or 4 bytes in UTF-8 will appear this way.
It produces this type of view:
The color scaled used is the same as the one introduced by pngthermal:
More precisely the cost of a symbol based on its background color is:
- midnight blue = strictly less than a bit
- dark blue = strictly less than 2 bits
- royal blue = strictly less than 3 bits
- teal = strictly less than 4 bits
- emerald green = strictly less than 5 bits
- chartreuse = strictly less than 6 bits
- yellow = strictly less than 7 bits
- orange = strictly less than 8 bits
- bright red = strictly less than 9 bits
- darker red tones are used the same way for 10, 11, 12... bits
You may have noticed that I have used an HTML document (http://encode.ru/ landing page) in my first sample, since HTTP/1.1 web servers usually compress text documents (html, css, js, svg...) on the fly using a Gzip/Deflate based compression tool, for instance Apache uses mod_deflate (which in turn calls the well-known Zlib) to send Gzip data.
My first incentive was to easily demonstrate that the HTML5 document type declaration should be written in lower case rather than upper case:
<!DOCTYPE html> bad
<!doctype html> good (saves 2 bytes once compressed)
For the compression gurus familiar with the Shannon entropy it's pretty obvious, for the rest of the world red areas now mark bad compression locations.
Rule #1 upper case = bloody rare = bad compression
The same goes for charset="UTF-8" -> "utf-8" since based on the IANA Charset Reference: "no distinction is made between use of upper and lower case letters".
Gzip/Deflate compression is basically LZSS+Huffman, the Variable Length Encoding (VLE) provided by the Huffman algorithm is applied to three different types of elements: literals (standalone symbols that are not part of an LZ match), the LZ length and the LZ distance of an LZ match pair.
The number of bits used to represent a symbol is derived from its frequency of occurrence, basically symbols that appear often end up with shorter codes (say 3, 4, 5 bits, it also depends on the size of the source alphabet) and those less frequent end up with larger codes (say 8, 9, 10, ... bits). This type of VLE actually predates Huffman coding, Morse works this way, the Lineotype keyboard layout was driven by character frequencies.
This effect of VLE on the LZ match components is a bit harder to grasp, for instance if there are way more LZ matches of length 5 (5 symbols replicated from a previous location) than LZ matches of length 3 it could lead to the counter-intuitive situation where larger words cost indeed less in the compressed stream than shorter ones.
You'll notice that compression works better toward the end of files (more blue, darker blue) or at least after some amount of data has been processed (a few thousands bytes/symbols). The main reason behind this warm-up period is that early in the data stream the dictionary (or more precisely the sliding window here) is nearly empty and a lot of symbols/words appear for the first time, afterwards when a group of symbols/word/group of words appears again it will be replaced by a reference to a previous occurrence (that's an LZ match) and this saves a lot of space, hence blue areas. In comparison first occurrences appear bulky as if there was a tax associated to novelty.
Here I moved some keywords around to see if it could help compression.
"compression,data compression" have been swapped -> "data compression,compression"
This brings the second occurrence of "compression" closer to the first one, visually there's no difference, and effectively the references to "compression" have the same costs: 14 bits be it 17 or 12 bytes away ( (12,17) and  (12,12) in defdb), but the reference to "ta " (end of "data " duplicated from "meta ") costs one bit less ( (3,42) vs  (3,30)) this unfortunately did not appear since both 13/3 and 12/3 give a value greater or equal to 4 and strictly lower than 5 leading to the green color, nevertheless a good move.
"zip" has moved closer to "7zip", "zip,rar,ace,7zip" -> "rar,ace,zip,7zip", this time the "zip," found at the end of 7zip is a bit darker and it costs effectively 1 bit less ( (4,13) vs  (4,5)), good move.
"paq" has moved to the end of the paqx series, "paq,paq6,paq7,paq8,paq9" -> "paq6,paq7,paq8,paq9,paq", this led to more regular back references ( (4,4)... (4,5)... (4,5)... (4,5) vs  (4,5)... (4,5)... (4,5)... (4,5)) but no saving at all.
So far it looks like the compressed stream sizes should differ by 2 bits... and the tricky thing is that it's not the case, the second one is 3 bits shorter (and due to byte rounding an entire byte less). Huh?! How is that possible?
Notice that the first block header is a bit shorter (h.size is the block header size) in sample-b.gz 597 vs 598 bits, something has changed in-there! Only a few distances have changed, let's take a look to the number of occurrences of the LZ match distance component (a and b columns, respectively sample-a.gz and sample-b.gz):
Picolo:gzsample Fred$ defdb sample-a.gz
T Boundary Tokens h.size b.size
2 0 4096 598 46880
2 6e0c 801 529 10059
56939 bits long (2 blocks)
Picolo:gzsample Fred$ defdb sample-b.gz
T Boundary Tokens h.size b.size
2 0 4096 597 46877
2 6e0c 801 529 10059
56936 bits long (2 blocks)
More header analysis to come... did not really expect this one to be that complicated.
Code Range a b
d_00  3 3
d_01  3 3
d_03  7 6 ",paq" (4,4) gone
d_04 [5-6] 15 17 ",paq" (4,5) added, "zip," (4,5) added
d_05 [7-8] 6 6
d_06 [9-12] 26 27 "compression," (12,12) added
d_07 [13-16] 29 28 "zip," (4,13) gone
d_08 [17-24] 54 53 "compression," (12,17) gone
d_09 [25-32] 49 50 "ta " (3,30) added
d_10 [33-48] 72 71 "ta " (3,42) gone
Rule #2 move similar words closer together to reduce the distance cost in the LZSS pair.
UglifyJS2 apparently does some basic stuff like counting character frequency but still struggles with gzip compression, I think gzthermal and defdb could pave the way to overcome this type of problem.