Results 1 to 13 of 13

Thread: gzthermal: pseudo thermal view of Gzip/Deflate compression efficiency

  1. #1
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    178
    Thanks
    6
    Thanked 42 Times in 25 Posts

    gzthermal: pseudo thermal view of Gzip/Deflate compression efficiency

    Deflate/Gzip compression is no longer a black box!
    The son of defdb and pngthermal is finally there, defdb is a good tool to scrutinize Deflate compressed data at a very low level but it doesn't give the "big picture", gzthermal tries to make it more visual and accessible to the average users (more features will be added over time to help distinguish literals from LZ matches and display LZ matches boundaries).
    Code:
    gzthermal: displays compression efficiency heatmap of gzipped files.
    Version 0.4b (23 May 2014) by Frederic Kayser
    Usage: gzthermal [-e|-s|-m|-l] [-n|-w] [-b] [-z] [-g] file.gz
    Output file is a PNG image called "gzthermal-result.png"
    Options: e, s, m, l  extra small, small, medium or large text output
             n, w        narrow or wide text lines (default is in between)
             b           binary/hex mode (small text size equivalent only)
             z           bicolor (distinguish LZ matches from literals)
             g           grayscale mode (may help colorblind people)
    Regarding rendered text line width the default is 66 symbols, narrow 48 and wide 84.
    The input file has to be already compressed using gzip (or zopfli, kzip+kzip2gz, 7-zip/7za... it could also have been retrieved from a web server with gzip compression enabled), the resulting image file is usually quite big and may prove difficult to handle on resource limited computers.
    Currently non-printable ASCII characters are replaced by a small square, i.e. end of lines made of CR LF or just LF, TAB and Unicode code points that require 2, 3 or 4 bytes in UTF-8 will appear this way.

    It produces this type of view:

    The color scaled used is the same as the one introduced by pngthermal:

    More precisely the cost of a symbol based on its background color is:
    - midnight blue = strictly less than a bit
    - dark blue = strictly less than 2 bits
    - royal blue = strictly less than 3 bits
    - teal = strictly less than 4 bits
    - emerald green = strictly less than 5 bits
    - chartreuse = strictly less than 6 bits
    - yellow = strictly less than 7 bits
    - orange = strictly less than 8 bits
    - bright red = strictly less than 9 bits
    - darker red tones are used the same way for 10, 11, 12... bits

    You may have noticed that I have used an HTML document (http://encode.ru/ landing page) in my first sample, since HTTP/1.1 web servers usually compress text documents (html, css, js, svg...) on the fly using a Gzip/Deflate based compression tool, for instance Apache uses mod_deflate (which in turn calls the well-known Zlib) to send Gzip data.

    My first incentive was to easily demonstrate that the HTML5 document type declaration should be written in lower case rather than upper case:
    <!DOCTYPE html> bad
    <!doctype html> good (saves 2 bytes once compressed)

    For the compression gurus familiar with the Shannon entropy it's pretty obvious, for the rest of the world red areas now mark bad compression locations.
    Rule #1 upper case = bloody rare = bad compression

    The same goes for charset="UTF-8" -> "utf-8" since based on the IANA Charset Reference: "no distinction is made between use of upper and lower case letters".

    Gzip/Deflate compression is basically LZSS+Huffman, the Variable Length Encoding (VLE) provided by the Huffman algorithm is applied to three different types of elements: literals (standalone symbols that are not part of an LZ match), the LZ length and the LZ distance of an LZ match pair.
    The number of bits used to represent a symbol is derived from its frequency of occurrence, basically symbols that appear often end up with shorter codes (say 3, 4, 5 bits, it also depends on the size of the source alphabet) and those less frequent end up with larger codes (say 8, 9, 10, ... bits). This type of VLE actually predates Huffman coding, Morse works this way, the Lineotype keyboard layout was driven by character frequencies.
    This effect of VLE on the LZ match components is a bit harder to grasp, for instance if there are way more LZ matches of length 5 (5 symbols replicated from a previous location) than LZ matches of length 3 it could lead to the counter-intuitive situation where larger words cost indeed less in the compressed stream than shorter ones.

    You'll notice that compression works better toward the end of files (more blue, darker blue) or at least after some amount of data has been processed (a few thousands bytes/symbols). The main reason behind this warm-up period is that early in the data stream the dictionary (or more precisely the sliding window here) is nearly empty and a lot of symbols/words appear for the first time, afterwards when a group of symbols/word/group of words appears again it will be replaced by a reference to a previous occurrence (that's an LZ match) and this saves a lot of space, hence blue areas. In comparison first occurrences appear bulky as if there was a tax associated to novelty.
    This is also a good argument in favour of CSS and JavaScript files combining.

    Here I moved some keywords around to see if it could help compression.

    "compression,data compression" have been swapped -> "data compression,compression"
    This brings the second occurrence of "compression" closer to the first one, visually there's no difference, and effectively the references to "compression" have the same costs: 14 bits be it 17 or 12 bytes away ([14] (12,17) and [14] (12,12) in defdb), but the reference to "ta " (end of "data " duplicated from "meta ") costs one bit less ([13] (3,42) vs [12] (3,30)) this unfortunately did not appear since both 13/3 and 12/3 give a value greater or equal to 4 and strictly lower than 5 leading to the green color, nevertheless a good move.

    "zip" has moved closer to "7zip", "zip,rar,ace,7zip" -> "rar,ace,zip,7zip", this time the "zip," found at the end of 7zip is a bit darker and it costs effectively 1 bit less ([12] (4,13) vs [11] (4,5)), good move.

    "paq" has moved to the end of the paqx series, "paq,paq6,paq7,paq8,paq9" -> "paq6,paq7,paq8,paq9,paq", this led to more regular back references ([11] (4,4)...[11] (4,5)...[11] (4,5)...[11] (4,5) vs [11] (4,5)...[11] (4,5)...[11] (4,5)...[11] (4,5)) but no saving at all.

    So far it looks like the compressed stream sizes should differ by 2 bits... and the tricky thing is that it's not the case, the second one is 3 bits shorter (and due to byte rounding an entire byte less). Huh?! How is that possible?
    Code:
    Picolo:gzsample Fred$ defdb sample-a.gz
    T Boundary   Tokens   h.size   b.size
    2        0     4096      598    46880
    2     6e0c      801      529    10059
    56939 bits long (2 blocks)
    
    Picolo:gzsample Fred$ defdb sample-b.gz
    T Boundary   Tokens   h.size   b.size
    2        0     4096      597    46877
    2     6e0c      801      529    10059
    56936 bits long (2 blocks)
    Notice that the first block header is a bit shorter (h.size is the block header size) in sample-b.gz 597 vs 598 bits, something has changed in-there! Only a few distances have changed, let's take a look to the number of occurrences of the LZ match distance component (a and b columns, respectively sample-a.gz and sample-b.gz):

    Code:
    Code   Range   a    b
    d_00     [1]   3    3
    d_01     [2]   3    3
    d_02     [3]
    d_03     [4]   7    6  ",paq" (4,4) gone
    d_04   [5-6]  15   17  ",paq" (4,5) added, "zip," (4,5) added
    d_05   [7-8]   6    6
    d_06  [9-12]  26   27  "compression," (12,12) added
    d_07 [13-16]  29   28  "zip," (4,13) gone
    d_08 [17-24]  54   53  "compression," (12,17) gone
    d_09 [25-32]  49   50  "ta " (3,30) added
    d_10 [33-48]  72   71  "ta " (3,42) gone
    More header analysis to come... did not really expect this one to be that complicated.

    Rule #2 move similar words closer together to reduce the distance cost in the LZSS pair.

    A few years ago Nicholas Zackas wrote an interesting paper about JavaScript Minification, he talks about gzip compression at the end of it. To my knowledge these kind of tools never tried to get some feedback from the compression step to improve their minification heuristics like variable name replacement. This is probably one of the most interesting application gzthermal could be used for (apart from visually comparing LZ parsers and help improve those or rewrite a deflopt/defluff like tool).
    UglifyJS2 apparently does some basic stuff like counting character frequency but still struggles with gzip compression, I think gzthermal and defdb could pave the way to overcome this type of problem.
    Attached Files Attached Files
    Last edited by caveman; 23rd May 2014 at 21:34. Reason: New version 0.4b

  2. The Following 3 Users Say Thank You to caveman For This Useful Post:

    Cyan (5th March 2014),Mike (5th March 2014),taurus (17th April 2014)

  3. #2
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    178
    Thanks
    6
    Thanked 42 Times in 25 Posts
    Just to prove a point, here is a gzipped JPEG image (the .jpg.gz file is a tad smaller than the .jpg file 16494 vs 16567 bytes. To be clear: don't do this, it's not worth the CPU cost!) and the head of its compressed representation, it shows that the sole part of the JPEG file that could actually be compressed is the header mainly the content of the DQT (Define Quantization Table) chunk, the remaining of the file is in the same red tones.



    JPEGsnoop reports the offset of the different JPEG markers:
    Code:
    *** Marker: SOI (Start of Image) (xFFD8) ***
      OFFSET: 0x00000000
    
    *** Marker: APP0 (xFFE0) ***
      OFFSET: 0x00000002
    
    *** Marker: DQT (Define Quantization Table) (xFFDB) *** 
      OFFSET: 0x00000014
    
    *** Marker: SOF0 (Baseline DCT) (xFFC0) ***
      OFFSET: 0x0000009A
    
    *** Marker: DHT (Define Huffman Table) (xFFC4) ***
      OFFSET: 0x000000AD
    
    *** Marker: SOS (Start of Scan) (xFFDA) ***
      OFFSET: 0x00000163
     
    *** Decoding SCAN Data ***
      OFFSET: 0x00000171
    The Quantization Tables -QT- hold the "quality" setting of the JPEG image (yep, it's a bit more complicated than a single value picked with a slider in Photoshop, but this is really how it works!). Each QT holds 64 values (usually 8 bits values, they turn into 16 bits values when a lot of JPEG "quality" is sacrificed).


    Usually two QT are used:
    - one for the Y component (Luma)
    - one for Cb and Cr components (Chroma)
    (Cb and Cr could each have their own separate QT, but most of the time the same one is used)

    Here is what JPEGsnoop reports about those in the sample.jpg file:
    Code:
    *** Marker: DQT (Define Quantization Table) (xFFDB) *** 
      OFFSET: 0x00000014
      Table length = 132
      ----
      Precision=8 bits
      Destination ID=0 (Luminance)
        DQT, Row #0:   1   1   1   1   1   1   1   1
        DQT, Row #1:   1   1   1   1   1   1   1   1
        DQT, Row #2:   1   1   1   1   1   1   1   2
        DQT, Row #3:   1   1   1   1   1   1   2   2
        DQT, Row #4:   1   1   1   1   1   2   2   3
        DQT, Row #5:   1   1   1   1   2   2   3   3
        DQT, Row #6:   1   1   1   2   2   3   3   3
        DQT, Row #7:   1   1   2   2   3   3   3   3
        Approx quality factor = 98.25 (scaling=3.50 variance=4.81)
      ----
      Precision=8 bits
      Destination ID=1 (Chrominance)
        DQT, Row #0:   1   1   1   2   2   3   3   3
        DQT, Row #1:   1   1   1   2   3   3   3   3
        DQT, Row #2:   1   1   1   3   3   3   3   3
        DQT, Row #3:   2   2   3   3   3   3   3   3
        DQT, Row #4:   2   3   3   3   3   3   3   3
        DQT, Row #5:   3   3   3   3   3   3   3   3
        DQT, Row #6:   3   3   3   3   3   3   3   3
        DQT, Row #7:   3   3   3   3   3   3   3   3
        Approx quality factor = 98.39 (scaling=3.23 variance=0.50)
    Notice that there's a lot of redundancy in these tables which should naturaly lead to good compression all the more so since the tables are actually recorded following a zigzag path.


    And here is how the first table is compressed by Deflate/Gzip (Defdb output):
    Code:
     [8] 01
    [18] (42,1)
     [8] 02
    [16] (10,1)
     [8] 03
    [15] (9,1)
    Basically it says that the data stream consists of 43 values 1, followed by 11 values 2 and finally 10 values 3.
    What took 64 bytes uncompressed only takes 9 bytes and a bit (73 bits) in a Deflate/Gzip compressed JPEG, unfortunately that's the sole part of the entire file than can be easily compressed, and this sample is quite extreme since usually QT tables hold more than just three different values and the zigzag path does not always magically remove all the complexity of the table.

    Another spot that shows some signs of compression is the content of the DHT (Define Huffman Table) chunk that follows the FFC4 marker.

    Here is a slightly different heatmap view, where the JPEG markers appear and the entire DQT and DHT chunks are underlined, once the Start of Scan marker is crossed (FFDA marker) Deflate/Gzip will only face already compressed data which is extremely hard to compress a gain hence the red area that unfolds till the end of the file.


    Some advanced packers like PackJPG actually uncompress this layer, compress it with a different engine and recompress it as it was before during their uncompression phase. Data Compression Explained by Matt Mahoney holds a chapter dedicated to JPEG Recompression.


    And before you ask: yes I plan to add a binary/hex mode to gzthermal.
    Attached Files Attached Files
    Last edited by caveman; 29th April 2014 at 23:21. Reason: Link to DCE

  4. The Following 2 Users Say Thank You to caveman For This Useful Post:

    Cyan (28th April 2014),taurus (17th April 2014)

  5. #3
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    178
    Thanks
    6
    Thanked 42 Times in 25 Posts
    New version 0.3a with binary/hexadecimal mode (-b), and even smaller font size (-e).
    Linux ARMv7 for test, Windows version in a few days. First post updated.

  6. The Following User Says Thank You to caveman For This Useful Post:

    Paul W. (28th April 2014)

  7. #4
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    267
    Thanks
    104
    Thanked 22 Times in 21 Posts
    caveman,

    It would be interesting to see some examples of compressing uncompressed binary data of various kinds, using the hex output option, to see strides and other patterns in how gzip responds (or doesn't) to various regularities.

    If you can vary the row length in your pictures, you can probably make various data structures pop out in interesting ways.

    [EDIT: Doh. I forgot you already said you can output narrow & wide... and that narrow was 48. Might want to offer a few more carefully-chosen options, or just a general n columns option.]

    (E.g., if you output 48 columns, you'll be able to see all the powers of two and powers-of-two times 3, which are the most commons strides for architectural reasons. If it's 60 columns, you may be able to see strides like 1,2,3,4,5,6,10,12,15,20,30, & 60. If it's 77 you'll be able to see 7, 11, 14, 21, 22, etc.)

    I do some similar tricks with a couple of plotters resembling fv, and a few trivial filters. (One very handy one just zeroes the lowest few bits of every byte, so that numerically similar but slightly different bytes suddenly match. That can give you a hint that you're looking at numeric data, and make the strides more evident. Surprisingly, it doesn't make unaligned text-y things like English text or x86 machine code look very different in fv.)

    When I get around to it, I will try using gzthermal in conjunction with those other plots, to see which "obvious" regularities are or aren't exploited by gzip.

    Do you have the option of turning off Huffman coding, so you can compare/contrast to see how much compression you're getting from LZ string-matching and how much is from entropy coding literals or offsets?

    That sort of compare/contrast could be very useful in designing simple preprocessors that make non-Markov regularities more obvious to a back end universal compressor, and actually see how the rubber meets the road or fails to.

    Very interesting.
    Last edited by Paul W.; 28th April 2014 at 20:29. Reason: brainfart

  8. #5
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    178
    Thanks
    6
    Thanked 42 Times in 25 Posts
    Quote Originally Posted by Paul W. View Post
    It would be interesting to see some examples of compressing uncompressed binary data of various kinds, using the hex output option, to see strides and other patterns in how gzip responds (or doesn't) to various regularities.

    If you can vary the row length in your pictures, you can probably make various data structures pop out in interesting ways.

    [EDIT: Doh. I forgot you already said you can output narrow & wide... and that narrow was 48. Might want to offer a few more carefully-chosen options, or just a general n columns option.]

    (E.g., if you output 48 columns, you'll be able to see all the powers of two and powers-of-two times 3, which are the most commons strides for architectural reasons. If it's 60 columns, you may be able to see strides like 1,2,3,4,5,6,10,12,15,20,30, & 60. If it's 77 you'll be able to see 7, 11, 14, 21, 22, etc.)
    I may add an option that would allow to freely set the line width.

    Quote Originally Posted by Paul W. View Post
    I do some similar tricks with a couple of plotters resembling fv, and a few trivial filters. (One very handy one just zeroes the lowest few bits of every byte, so that numerically similar but slightly different bytes suddenly match. That can give you a hint that you're looking at numeric data, and make the strides more evident. Surprisingly, it doesn't make unaligned text-y things like English text or x86 machine code look very different in fv.)

    When I get around to it, I will try using gzthermal in conjunction with those other plots, to see which "obvious" regularities are or aren't exploited by gzip.
    I don't know what you call "fv", anyway you are free to feed gzthermal with whatever gzipped document you have at hand...

    Quote Originally Posted by Paul W. View Post
    Do you have the option of turning off Huffman coding, so you can compare/contrast to see how much compression you're getting from LZ string-matching and how much is from entropy coding literals or offsets?

    That sort of compare/contrast could be very useful in designing simple preprocessors that make non-Markov regularities more obvious to a back end universal compressor, and actually see how the rubber meets the road or fails to.
    I could eventually hack it to differentiate LZ matches from litterals using only two background colors...

  9. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,863
    Thanks
    171
    Thanked 513 Times in 338 Posts
    fv is probably http://mattmahoney.net/dc/fv.zip
    It is a program that inputs a file and produces a file fv.bmp showing the location and distances of matches using different colors to indicate match length.
    It produces images like this of the Calgary corpus: http://mattmahoney.net/dc/dce.html#Section_2

  10. #7
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    267
    Thanks
    104
    Thanked 22 Times in 21 Posts
    Matt, caveman,

    Yes, that's the fv I meant. I just started another thread with an example of using it with a couple of handy filters.

    http://encode.ru/threads/1937-Decons...simple-filters

  11. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    267
    Thanks
    104
    Thanked 22 Times in 21 Posts
    caveman:

    I could eventually hack it to differentiate LZ matches from litterals using only two background colors...
    Or I suppose you could change the colors of the characters, using grays vs. white, to encode a different channel of information, e.g., how much you're benefiting from Huffman.

  12. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    267
    Thanks
    104
    Thanked 22 Times in 21 Posts
    caveman,

    Is gzthermal open source? (If you said so, I somehow missed it, sorry.)

    I'm looking for a really easy way to do something much simpler, but visually similar, displaying lines of characters in colored boxes, where the background color is a function of the character code itself. (The visualizer would just take a file, a line length, and a 256-element table saying what color to make the background for each possible byte value. Ideally I could have a table for letter colors too.)

    I'm sure there are lots of ways this would be trivial if I was good at any particular relevant thing (e.g., generating HTML that messes with background colors and letter colors, or making ASCII art photo renderers), but I'm not, so I'm looking for an appropriate library to use or program I can adapt.

    If you're interested, here's another thread on using fv with simple filters to visualize structure and entropy in weird data:

    http://encode.ru/threads/1943-visual...simple-filters

    A few posts in, I use a couple of hex dumps to show certain patterns that show up in straight hex, and others that show up in the ASCII-interpreted columns. If I could just colorize the hexdump to point up qualitatively different kinds of values (e.g., numerically small binary values, letters that are very common in text like _etaoinshrlcu, common x86 bytecodes, etc.), a lot of different kinds of data would have visually obvious signatures, and you'd be able to see things like strides in certain qualitatively different fields of repeating structures, or where you have 8-bit ASCII vs. 16-bit Unicode text strings embedded in mostly machine code, etc.

    Some examples:

    If you reserve black and white for hex 00 and FF, a lot of binary data will have very distinctive columns or diagonals of black and/or white pixels, indicating that you very probably have binary numbers of certain sizes at certain strides, because their high bits are usually zero, or FF if they're negative. E.g., 16-bit stereo WAV file will regularly have stretches where have every other whole column is white for a while, then black for a while when the waveform crosses the zero. Mapping other small binary numbers to a range of blues, with smaller being lighter, would make that work even better.

    I think a single color map could be sufficient to be pretty well for most common kinds of data---texty stuff, binary numbers, machine code, etc.---but you could vary the color map to point up more subtle things in specific kinds of data. (E.g., if you know it's not machine code or text, you can use a map that encodes more numeric stuff clearly, e.g. just mapping the whole range of binary values to a heat-map style color spectrum, or approximately bit of that plus distinguishing between odd and even values.)

    I think that'd be a very useful complement to gzthermal and fv, letting you look at the data itself in a useful way, before you interpret it as string matches and literals or string matches and match distances.

  13. #10
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    178
    Thanks
    6
    Thanked 42 Times in 25 Posts
    Quote Originally Posted by Paul W. View Post
    Is gzthermal open source? (If you said so, I somehow missed it, sorry.)

    I'm looking for a really easy way to do something much simpler, but visually similar, displaying lines of characters in colored boxes, where the background color is a function of the character code itself. (The visualizer would just take a file, a line length, and a 256-element table saying what color to make the background for each possible byte value. Ideally I could have a table for letter colors too.)

    I'm sure there are lots of ways this would be trivial if I was good at any particular relevant thing (e.g., generating HTML that messes with background colors and letter colors, or making ASCII art photo renderers), but I'm not, so I'm looking for an appropriate library to use or program I can adapt.
    It's not open source, but a could eventually modify gzthermal in such a way it could do what you are looking for and hand it over to you (check your private messages).
    The way it works now limits the backgroud colors to 17 and text has to be white (It uses prerendered semi-transparent charset images, no real text compositing).

  14. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    267
    Thanks
    104
    Thanked 22 Times in 21 Posts
    caveman,

    What representation are your glyphs in? (The prerendered characters on colored backgrounds.)

    If they're in a pnm format, e.g., or some format you can easily interconvert with that, I think it's easy in Linux to convert them to alpha masks with something like ppmcolormask(1) or the -transparent option to pnmtopng(1); just give it a color argument saying that the white bits (the white character parts) should be transparent. Then you can easily composit your transparent-letters-on-colored-backgrounds with any colored rectangle you want for the actual character using pnmcomp(1).

    The easy way to do it without depending on having those libraries on your platform would be to just have b * c * 256 versions of your characters in a big 3D array, where b is the number of background colors, c, is the number of character colors, and n is the number of data values (256). If n and c are each 16, that's 16 * 16 * 256 = 64K, which if your glyphs are, say 12 x 16 pixels each is about 64K glyphs * 192 = about 12 megabytes. It's ridiculously fat, but very simple and portable, and for a visualization tool, who cares about 12 megabytes?

    There's presumably an easy way to do this cross-platform in Python with the Wand module, which as I understand it is a simple basic interface to the ImageMagic library. (There are other less-simple ones, but I'm guessing Wand would do fine.) I assume that would be plenty fast to composit the characters glyphs with colors in real time to display a page of legible text, and not require stories bunches of copies of the same glyphs with different coloring. Unfortunately I'm not really a Python programmer yet, but if you are, that might an easy portable way to go.

  15. #12
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    178
    Thanks
    6
    Thanked 42 Times in 25 Posts
    Quote Originally Posted by Paul W. View Post
    What representation are your glyphs in? (The prerendered characters on colored backgrounds.)
    Paletted PNG with many transparent colors, but only the heart of the compressed image is in fact stored.
    The glyph image had 16 palette entries (14 are partially transparent, 1 fully), since only bright white is non transparent it is shared among all the "sub palettes" in the final 256 colors palette (1 + 15x17: 1 white, 15 transparent colors composited over 17 different backgrounds) the solid colors are computed on the fly (if you want to hack the background colors used by gzthermal... look for: 0x00,0x05,0x60, 0x02,0x3D,0x9A, 0x00,0x5F,0xD3, 0x01,0x86,0xC0, 0x4A,0xB0,0x3D, 0xB5,0xD0,0x00, 0xEB,0xD1,0x09, 0xFB,0xA7,0x0F, 0xEE,0x00,0x00, 0xD0,0x00,0x00... in the binary executable, these are the RGB values used to define the colors, the first one is night blue).

    Of course I could produce a truecolor PNG and start with a higher quality font rendering, but the resulting image files are already damn huge and the increased quality would be barely noticeable.

    Click image for larger version. 

Name:	pal.png 
Views:	137 
Size:	23.0 KB 
ID:	2920
    Last edited by caveman; 16th May 2014 at 01:24.

  16. #13
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    267
    Thanks
    104
    Thanked 22 Times in 21 Posts

    fv + simple filters + atxd + gzthermal = too cool

    OK, thanks.

    As far as colorized hex dumping per se goes, think for now I'm surprisingly happy for the moment with atxd. (The little two-pages-of-Python html-generating colorizing hex dumper I wrote a few days ago and posted about here.) I can use it for visually informative hex dumping at various stride-related line lengths, and fv (plus simple filters) to visualize structure and entropy the fv way, and gzthermal (maybe plus simple filters) to see the same stuff actually get compressed in detail, up close and personal. Too damned cool. Lots to play with, and I can suddenly just see stuff that was hard to even guess at before. (Without munging color codes in binaries. The colormaps weren't going to match anyhow, given that each view wants a different color map to show different things, so I think I'll leave gzthermal alone for now, and just use it.)

    At some point I'd like to glom all this together into a simple compression-oriented interactive data visualization program that lets you select and zoom, and pick various complementary views and filters in coordinated ways, with various corresponding pictures neatly lined up to see the correspondences... probably in Python with NumPy and SciPy (incl. PyPlot), which unfortunately I need to properly learn first...

    ... but these tools and a few filters, as-is, make a nice complementary suite, which I can script or use manually much, much, more efficiently than I was doing before. Neato. Now I can actually see various regularities and their consequences, and some things I've been stupidly speculating about for years. Sweet!

Similar Threads

  1. Replies: 12
    Last Post: 17th November 2014, 02:19
  2. Replies: 2
    Last Post: 13th November 2012, 02:47
  3. Idea for raising compression efficiency on disk images
    By Mexxi in forum Data Compression
    Replies: 10
    Last Post: 18th February 2010, 04:56
  4. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 02:41
  5. gzip-1.2.4-hack - a hacked version of gzip
    By encode in forum Forum Archive
    Replies: 63
    Last Post: 10th September 2007, 04:16

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •