I've been experimenting with some tools for visualizing structure & entropy in arbitrary (uncompressed) files. (See the "Deconstructing Bjork" thread for examples with audio and images.)
Its pretty easy to recognize a lot of common kinds of data---text, variable-length instruction code, uncompressed 8- or 16-bit integer grayscale or RGB data, integer audio, etc---and to pick out the major strides and whatnot. (Audio channel strides in WAVs, pixel and raster strides in images, etc.)
It gets more interesting when you're looking at application- or program-specific data formats that involve interleaved batches of mixed kinds of data, floating point numbers, etc.
So I thought I'd look at geo from the Calgary corpus, which is seismic data in an obscure obsolete data format, using nonstandard floating point and some pretty complicated rounding and truncation rules.
(This is mostly an exercise in visualization for compression... I can't say I know how to compress geo well, at least not yet.)
If you just run geo through Matt Mahoney's fv program, which plots string matches of four lengths (1, 2, 4, and 8 ) by their distances, it looks like this:
The match distance is the vertical (base 10) log scale---each major tick represents a factor of 10, so we're seeing match distances at a range from 1 to 9 near the bottom, in bins of 10 between the minor ticks (from 10-19, 20-29...90-99, above that, then bins of 100 (100-199, 200-299... 900-999), bins of 1000, and finally by 10,000s at the top.
The color shading represents the match lengths. If you have matches of length 1, they contribute blackness to the pixel range for their match distance bin, matches of length 2 contribute redness, 4-byte matches contribute green, and 8-byte matches contribute blue.
In the plot above, we can see a whole lot of blackness at 4, meaning we're getting a lot of repeated bytes four bytes apart, most of the time throughout the plot. We also get longer matches at much longer distances that are multiples of four, suggesting that our data is basically 4-byte data (which it is).
The shift from blackish to reddish to greenish stripes as we go up in distance is normal. All other things being equal, you expect more 1 byte matches at short distances, 2 byte matches at longer distances (because they're more specific), and so on.
Unfortunately, the broadness of the reddish and greenish bands suggests that the data are pretty random---we're not getting noticeably more matches to longer strings than we'd expect by chance, and normal text-oriented compression algorithms aren't going to do well for this data. We need a better model in the compressor, or in a preprocessor.
Another thing to notice in this basic plot: There are periodic dark places in the bottom stripe, at distance 1. That suggests that the data are punctuated by headers or something with very different entropic properties---namely, runs of repeated bytes. (And that's true. We're seeing 200-byte headers that are very runny, and very unlike the 4-byte float data that make up the batches of the file. Within those headers, we also see some matches to strings from the previous headers---at heights corresponding to the two batch sizes.)
By the spacing of those dark spots, we can see the sizes of the batches of float data in between, and a pattern in the batch sizes. We see a repeated pattern of two big batches (something over 1300 bytes) and then a smaller batch (something over 600), over and over, with 20-odd batches total.
We're not seeing any interesting structure in those batches---the statistics we're looking at don't change much across batch boundaries, so we must be seeing about as much matching to data in previous batches, near the beginning of a batch, as to data within a batch, toward the end of that batch. That too looks pretty random. Bummer.
Now I'll (de)interleave the data on a stride of 4 bytes, so that we reorder the file so that every fourth byte is shown, then every four bytes offset by 1, then every four off by 2 and then every four off by 3. That puts all the first-bytes-of-four together at the beginning of the picture, in the same order they occur in the file, and likewise for the ones following those, etc.:
OK, now we're getting somewhere. The first bytes are very different from the 2nd bytes or 3rd bytes, which like pretty much like noise. They have a lot less entropy, and some visible patterning. The horizontal greenish and bluish blotchy bars says that there's some larger-scale stringishness and stridiness going on. (That's more obvious in the thumbnail than in the full-size picture.)
The fourth bytes of groups of four have even less entropy---almost none, and the black bar at 1 says we're just seeing runs of repeated bytes. (Except for the high-up blips where we see headers between the batches of data.) If you know the geo data format, or take a peek at the file with hexdump, you know why: they're all zero. Our four-byte float data are really three byte floats padded out to four with zeroes.
The big difference in entropy between first bytes and second bytes is typical of either floats or 16-bit integer data, as in stereo audio or 16-bit color samples for 48-bit RGB. We could be seeing the high and low bytes of channels. For audio or pictures, we usually wouldn't see one low-entropy channel and two high-entropy noise-like channels, though. We'd more likely see pairs, for high and low bits of left and right channels, or triples of pairs for 16-bit color. (Maybe padded out to quadruples with zeroes, or a low-entropy alpha or Z channel.)
At this point, if we didn't already know, it'd be a good guess that we're looking at 32-bit float data that for some reason have been truncated or rounded to remove insignificant low bits in the mantissas. (e.g., limited sensor precision, filtering out noise, making it clear there's no sigificant digits there, etc.)
The low-entropy "channel" on the left is likely the exponents, which often don't vary nearly as much, for any or all of several reasons. (Limited range of the data being modeled, e.g. 0 to 1 for probabilities or 0-100 for percentages, or slowly changing values if what's being modeled is a discretized curve or surface, etc.)
Floats are tricky, though, because float formats may not line up with bytes. You might see a small exponent plus the high bits of a mantissa in the first byte, or most of the exponent in the first byte and the rest of it and the start of the mantissa in the next, or whatever, depending on the size and representation of your floats. (And when you're looking at data in a foreign endianness, things can get pretty messed up.)
I'll ignore that, because we happen to know that the data are actually byte-aligned, with one byte that's the exponent and two that are the mantissa. I am only guessing that it's the first (0th) bytes that are the exponents... I haven't checked.
Since this is a lot of data, and (de)interleaving compresses the visual scale of a particular part you're interested in, I'm going to zoom in a bit, by a factor of 4, and look at one quarter of the file from near the middle. When you zoom, things get grittier-looking:
This is about two cycles of the repeating long batch / long batch / short batch cycle, but it's a bit off. That's enough to see most patterns that recur at that interval.
That zoom is helpful for the next picture, where I'll (de)interleave this smaller chunk of data at a stride of four bytes before running it through fv:
Now it's easier to see things going on in each quarter of the picture.
But we've lost something, too... where'd the stringiness and stridiness in the left quarter go? Where are the horizontal greenish and bluish blotchy bars? It looks a lot more like blotches, and less like blotchy bars. (Partly because the first blotch doesn't look right, because it doesn't have a previous blotch to match against in this short data set as it does in the full file.)
We can clarify what's going on, a lot. When you have stringy-stridey repetitiveness in the high bytes of ints or floats, it's often strongest in the very highest bits---e.g., the top 4 of 8 bits. For now, we'll just toss out the low bits, running the data
through a filter that just quantizes every byte, shifting out the low four bits and putting just the high four bits there, and redo the fv picture:
OK, that's interesting. Now we can clearly see that the blotches in the first bytes have a particular approximate shape, and repeat. There's something regular going on, and those peculiar blotches are what made up the blotchy bars in the picture for the whole file. We only see about three repeats here, but they go on through the whole data set.
Of course, this kind of picture doesn't tell you how much compression you can actually get---we're blithely throwing away half the bits of every byte, to bring out structure and patterns in the data. That's mainly useful for figuring out roughly what's going on in the data, in hopes of appropriately modeling the real regularities in those bytes, and the other bytes too.
Now I'm going to apply a new filter, which is a lossy delta coder, though I'm going to use it losslessly---it's lossless within the range that we need. I use a lossy coder so that I can do things bytewise for input into fv, or other filters before fv.
For each byte, the delta coder just outputs the difference from the previous byte as a byte. If the difference is greater than 128 or less than 128, it wraps around and gives you a number in that range.
For what we're doing here, it never wraps, because our values after quantization are actually 4-bit values stored in bytes, so their difference can never be more than 32 or less than -32.
(I switched quantizers to make this work; in the earlier thread about analyzing audio and images, I used a filter that zeroed out the low bits of bytes and left the high bits in place. With a lossy delta coder, it's better to shift than mask, so you put the high bits in the lossless range for the coder.)
Notice that we're just replacing bytes (nybbles really) with their deltas from their immediately preceding bytes/nybbles; we're not doing actual compression in the filter. Running the result through fv does do similar modeling to an LZ-style compressor there, so it often shows opportunities for compression.
So here's the picture with delta coding of high nybbles and fv's string-matching:
Notice the colored horizontal bars in the lower left of the picture, which are not there without delta coding. Literal string repeats usually show up after delta coding, a bit weaker. (Two copies of the same literal string will mostly match after delta coding---all but the first byte.)
There seems to be something very regular going on, with repeated patterns of deltas in at least the high nybbles of the first bytes. Notice in particular the blue bar at 9---that means we're seeing a fair number of repeats of 8-byte strings of deltas, at an interval of 9 bytes after deinterleaving at 4, so at a stride 36 bytes in the actual data.
(Or so I infer; maybe I'm misinterpreting something, or possibly have a bug that shows up in this weird way...)
Does anybody know if there's a known stride of 36, (or any multiple of 3) in the geo data set? I don't recall any mention of that in the published descriptions I've seen. I'm wondering (assuming it's real) if people are modeling it, maybe without knowing it, e.g., because of LZMA's trick of using low bits of offsets as context to implicitly deal with strides, heuristically. (Though my impression is that LZMA isn't very good at large strides with odd factors.)
There's more going on in the pictures I've looked at, including some weak regularities in the second bytes and third bytes of fours, but I thought this was the most interesting so far, and I'll stop here for now.
Feedback on this would be very welcome.