Somebody asked in another thread what fv is, and I thought I'd put my answer in a separate thread rather than derailing that one...
fv is a spiffy plotting program Matt Mahoney wrote. It goes through a file keeping dictionaries of 1-, 2-, 4-, and 8-byte strings (hashes actually), and plotting the match distances. 1-byte matches show up in black, 2-byte matches show up in red, 4-bytes in green, and 8-bytes in blue. Matt's "Data Compression Explained" online book uses a few pictures generated with fv; so does his page talking about the data in the LTCB test corpus:
There's a link in there to the program. It's free and open source.
I find fv to be most interesting when applied to non-text data, in conjunction with some simple filters to quantize and deinterleave data.
For example, if you look at an uncompressed audio or image file straight through fv, you can generally see the major strides, including the 4-byte stride of 16-bit stereo WAV's, the pixel and raster strides for RGB images, etc.
Here's a hunk of a Bjork song ("Hunter") as 16 bit stereo:
The plot shows that as we go left to right through the file, we see a lot of one-byte matches (gray) at distances distributed anywhere from 4 to about a thousand bytes, which looks a lot like noise. We also get a lot of two-byte matches at much longer distances, which also looks a lot like noise. This is clearly not text, and LZ77 isn't going to do well here---we have pretty short match lengths at distances long enough to not be worth it. And we're not seeing any noticeable 4-byte matches (green) or 8-byte matches (blue).
Worse, if you look at the distribution of one-byte match distances, we're not seeing the mostly short-range matches that would make Huffman coding bytes work well. If we had the kind of striking frequency skew we want, we'd have noticeable match-distance skew too, because frequently-seen data are usually recently-seen too.
But there's a noticeable stride---the gray line across the plot at 4, and the weaker lines at 8, 12, 16. (Those strides continue up until they get lost in the log plot blur, but they're still there in the gray and pink bands, as a linear plot can show.)
If a stride is evident like that, you can just deinterleave the data on that stride before running it through fv, and pow, you can see what's happening in the most and least significant bytes of each channel, as separate side-by-side pictures within the fv plot.
That's what you get by skipping through memory looking at every 4th byte, seeing all the four-byte-aligned bytes, then doing it again for the ones that are four-byte-aligned but off by 1, then again by 2 and 3.
In that picture we can see that the even bytes still look like noise, but the odd ones don't at all. (That's typical of 16-bit data representing an approximation of a more or less continuous function, whether it's specifically audio or not.)
Or you can quantize the data without deinterleaving---just blindly zero out the lowest few bits of every byte---and various strides become much more obvious in the interleaved data:
(Here I've zeroed the lower four bits of each byte and left the other four intact.)
For an audio file, quantizing like that often brings out correlation between channels, which would be lost in a deinterleaved plot. The plain fv plot will have a strong stripe at say, 4 for 16-bit stereo audio, because within each channel, the most significant bytes are usually the same as for the previous sample. But if you quantize, you start to see more matches across the channels, with weaker stripes at 2, 6, etc. too.
(I'd guess that's partly because (1) except when the music is loudest, the sample values tend to be closer to the center of their range than the extremes, and partly because (2) the overall amplitude is usually dominated by low frequencies, and bassy sources are usually not placed far off to the side of the stereo field.)
Less obviously, you can also do simple tricks that often make an unknown stride obvious, like blindly deinterleaving
the data at a stride of 48, and seeing what happens. If there's a stride on a factor of 48 (e.g., 12 or 16) it will usually show up in the 48-way deinterleaved plot in a way that gives you a big hint what the real stride is. (Conversely, if you deinterleave on a small stride that's a factor of an actual, larger stride, that will generally show up visually in a different way.)
You can essentially use fv with a couple of filters, varying the parameters, to "tune in" to the regularities in structured data, sort of like messing around with an oscilloscope to find the right fundamental frequency. Once you've got that, you can mess around with things like quantization to see what happens to various fields when you lop off a few low bits, or a few high bits, to get hints as to what kind of data is at each offset from the stride. (It's also useful to have filters that respect alignment, chopping things up into nonoverlapping chunks.)
Here's Bjork with both quantization (zeroing the least significant halves of bytes) and deinterleaving:
Here you can see that in the most significant halves of the high bytes match a lot better---they very often match the preceding sample's upper 4 bits exactly. The same isn't true of the high bits of the least significant bytes---even if we lop off the four lowest bits off, it still looks like noise.