Results 1 to 11 of 11

Thread: visualizing structure & entropy in geo with fv & simple filters

  1. #1
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    visualizing structure & entropy in geo with fv & simple filters

    I've been experimenting with some tools for visualizing structure & entropy in arbitrary (uncompressed) files. (See the "Deconstructing Bjork" thread for examples with audio and images.)

    Its pretty easy to recognize a lot of common kinds of data---text, variable-length instruction code, uncompressed 8- or 16-bit integer grayscale or RGB data, integer audio, etc---and to pick out the major strides and whatnot. (Audio channel strides in WAVs, pixel and raster strides in images, etc.)

    It gets more interesting when you're looking at application- or program-specific data formats that involve interleaved batches of mixed kinds of data, floating point numbers, etc.

    So I thought I'd look at geo from the Calgary corpus, which is seismic data in an obscure obsolete data format, using nonstandard floating point and some pretty complicated rounding and truncation rules.

    (This is mostly an exercise in visualization for compression... I can't say I know how to compress geo well, at least not yet.)

    If you just run geo through Matt Mahoney's fv program, which plots string matches of four lengths (1, 2, 4, and 8 ) by their distances, it looks like this:

    Click image for larger version. 

Name:	geo_fv.png 
Views:	304 
Size:	105.7 KB 
ID:	2890

    The match distance is the vertical (base 10) log scale---each major tick represents a factor of 10, so we're seeing match distances at a range from 1 to 9 near the bottom, in bins of 10 between the minor ticks (from 10-19, 20-29...90-99, above that, then bins of 100 (100-199, 200-299... 900-999), bins of 1000, and finally by 10,000s at the top.

    The color shading represents the match lengths. If you have matches of length 1, they contribute blackness to the pixel range for their match distance bin, matches of length 2 contribute redness, 4-byte matches contribute green, and 8-byte matches contribute blue.

    In the plot above, we can see a whole lot of blackness at 4, meaning we're getting a lot of repeated bytes four bytes apart, most of the time throughout the plot. We also get longer matches at much longer distances that are multiples of four, suggesting that our data is basically 4-byte data (which it is).

    The shift from blackish to reddish to greenish stripes as we go up in distance is normal. All other things being equal, you expect more 1 byte matches at short distances, 2 byte matches at longer distances (because they're more specific), and so on.

    Unfortunately, the broadness of the reddish and greenish bands suggests that the data are pretty random---we're not getting noticeably more matches to longer strings than we'd expect by chance, and normal text-oriented compression algorithms aren't going to do well for this data. We need a better model in the compressor, or in a preprocessor.

    Another thing to notice in this basic plot: There are periodic dark places in the bottom stripe, at distance 1. That suggests that the data are punctuated by headers or something with very different entropic properties---namely, runs of repeated bytes. (And that's true. We're seeing 200-byte headers that are very runny, and very unlike the 4-byte float data that make up the batches of the file. Within those headers, we also see some matches to strings from the previous headers---at heights corresponding to the two batch sizes.)

    By the spacing of those dark spots, we can see the sizes of the batches of float data in between, and a pattern in the batch sizes. We see a repeated pattern of two big batches (something over 1300 bytes) and then a smaller batch (something over 600), over and over, with 20-odd batches total.

    We're not seeing any interesting structure in those batches---the statistics we're looking at don't change much across batch boundaries, so we must be seeing about as much matching to data in previous batches, near the beginning of a batch, as to data within a batch, toward the end of that batch. That too looks pretty random. Bummer.

    Now I'll (de)interleave the data on a stride of 4 bytes, so that we reorder the file so that every fourth byte is shown, then every four bytes offset by 1, then every four off by 2 and then every four off by 3. That puts all the first-bytes-of-four together at the beginning of the picture, in the same order they occur in the file, and likewise for the ones following those, etc.:

    Click image for larger version. 

Name:	geo_x4_fv.png 
Views:	328 
Size:	93.8 KB 
ID:	2891

    OK, now we're getting somewhere. The first bytes are very different from the 2nd bytes or 3rd bytes, which like pretty much like noise. They have a lot less entropy, and some visible patterning. The horizontal greenish and bluish blotchy bars says that there's some larger-scale stringishness and stridiness going on. (That's more obvious in the thumbnail than in the full-size picture.)

    The fourth bytes of groups of four have even less entropy---almost none, and the black bar at 1 says we're just seeing runs of repeated bytes. (Except for the high-up blips where we see headers between the batches of data.) If you know the geo data format, or take a peek at the file with hexdump, you know why: they're all zero. Our four-byte float data are really three byte floats padded out to four with zeroes.

    The big difference in entropy between first bytes and second bytes is typical of either floats or 16-bit integer data, as in stereo audio or 16-bit color samples for 48-bit RGB. We could be seeing the high and low bytes of channels. For audio or pictures, we usually wouldn't see one low-entropy channel and two high-entropy noise-like channels, though. We'd more likely see pairs, for high and low bits of left and right channels, or triples of pairs for 16-bit color. (Maybe padded out to quadruples with zeroes, or a low-entropy alpha or Z channel.)

    At this point, if we didn't already know, it'd be a good guess that we're looking at 32-bit float data that for some reason have been truncated or rounded to remove insignificant low bits in the mantissas. (e.g., limited sensor precision, filtering out noise, making it clear there's no sigificant digits there, etc.)

    The low-entropy "channel" on the left is likely the exponents, which often don't vary nearly as much, for any or all of several reasons. (Limited range of the data being modeled, e.g. 0 to 1 for probabilities or 0-100 for percentages, or slowly changing values if what's being modeled is a discretized curve or surface, etc.)

    Floats are tricky, though, because float formats may not line up with bytes. You might see a small exponent plus the high bits of a mantissa in the first byte, or most of the exponent in the first byte and the rest of it and the start of the mantissa in the next, or whatever, depending on the size and representation of your floats. (And when you're looking at data in a foreign endianness, things can get pretty messed up.)

    I'll ignore that, because we happen to know that the data are actually byte-aligned, with one byte that's the exponent and two that are the mantissa. I am only guessing that it's the first (0th) bytes that are the exponents... I haven't checked.

    Since this is a lot of data, and (de)interleaving compresses the visual scale of a particular part you're interested in, I'm going to zoom in a bit, by a factor of 4, and look at one quarter of the file from near the middle. When you zoom, things get grittier-looking:

    Click image for larger version. 

Name:	geoqtr_fv.png 
Views:	227 
Size:	23.4 KB 
ID:	2887

    This is about two cycles of the repeating long batch / long batch / short batch cycle, but it's a bit off. That's enough to see most patterns that recur at that interval.

    That zoom is helpful for the next picture, where I'll (de)interleave this smaller chunk of data at a stride of four bytes before running it through fv:

    Click image for larger version. 

Name:	geoqtr_x4_fv.png 
Views:	210 
Size:	22.5 KB 
ID:	2888

    Now it's easier to see things going on in each quarter of the picture.

    But we've lost something, too... where'd the stringiness and stridiness in the left quarter go? Where are the horizontal greenish and bluish blotchy bars? It looks a lot more like blotches, and less like blotchy bars. (Partly because the first blotch doesn't look right, because it doesn't have a previous blotch to match against in this short data set as it does in the full file.)

    We can clarify what's going on, a lot. When you have stringy-stridey repetitiveness in the high bytes of ints or floats, it's often strongest in the very highest bits---e.g., the top 4 of 8 bits. For now, we'll just toss out the low bits, running the data
    through a filter that just quantizes every byte, shifting out the low four bits and putting just the high four bits there, and redo the fv picture:
    Click image for larger version. 

Name:	geoqtr_x4_sr4_fv.png 
Views:	213 
Size:	25.4 KB 
ID:	2892

    OK, that's interesting. Now we can clearly see that the blotches in the first bytes have a particular approximate shape, and repeat. There's something regular going on, and those peculiar blotches are what made up the blotchy bars in the picture for the whole file. We only see about three repeats here, but they go on through the whole data set.

    Of course, this kind of picture doesn't tell you how much compression you can actually get---we're blithely throwing away half the bits of every byte, to bring out structure and patterns in the data. That's mainly useful for figuring out roughly what's going on in the data, in hopes of appropriately modeling the real regularities in those bytes, and the other bytes too.

    Now I'm going to apply a new filter, which is a lossy delta coder, though I'm going to use it losslessly---it's lossless within the range that we need. I use a lossy coder so that I can do things bytewise for input into fv, or other filters before fv.

    For each byte, the delta coder just outputs the difference from the previous byte as a byte. If the difference is greater than 128 or less than 128, it wraps around and gives you a number in that range.

    For what we're doing here, it never wraps, because our values after quantization are actually 4-bit values stored in bytes, so their difference can never be more than 32 or less than -32.

    (I switched quantizers to make this work; in the earlier thread about analyzing audio and images, I used a filter that zeroed out the low bits of bytes and left the high bits in place. With a lossy delta coder, it's better to shift than mask, so you put the high bits in the lossless range for the coder.)

    Notice that we're just replacing bytes (nybbles really) with their deltas from their immediately preceding bytes/nybbles; we're not doing actual compression in the filter. Running the result through fv does do similar modeling to an LZ-style compressor there, so it often shows opportunities for compression.

    So here's the picture with delta coding of high nybbles and fv's string-matching:

    Click image for larger version. 

Name:	geoqtr_x4_sr4_dlt_fv.png 
Views:	230 
Size:	26.0 KB 
ID:	2889

    Notice the colored horizontal bars in the lower left of the picture, which are not there without delta coding. Literal string repeats usually show up after delta coding, a bit weaker. (Two copies of the same literal string will mostly match after delta coding---all but the first byte.)

    There seems to be something very regular going on, with repeated patterns of deltas in at least the high nybbles of the first bytes. Notice in particular the blue bar at 9---that means we're seeing a fair number of repeats of 8-byte strings of deltas, at an interval of 9 bytes after deinterleaving at 4, so at a stride 36 bytes in the actual data.

    (Or so I infer; maybe I'm misinterpreting something, or possibly have a bug that shows up in this weird way...)

    Does anybody know if there's a known stride of 36, (or any multiple of 3) in the geo data set? I don't recall any mention of that in the published descriptions I've seen. I'm wondering (assuming it's real) if people are modeling it, maybe without knowing it, e.g., because of LZMA's trick of using low bits of offsets as context to implicitly deal with strides, heuristically. (Though my impression is that LZMA isn't very good at large strides with odd factors.)

    There's more going on in the pictures I've looked at, including some weak regularities in the second bytes and third bytes of fours, but I thought this was the most interesting so far, and I'll stop here for now.

    Feedback on this would be very welcome.
    Last edited by Paul W.; 6th May 2014 at 18:34. Reason: fix timed-out picture includes

  2. #2
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I had a token expire while editing that and it seems to have botched up some of my picture includes... I'll try to fix it.

    [EDIT... oh great, that just borked different ones... trying again, doing them all over...]

    [EDIT 2: Looks okay to me now. If you can't see the pictures, let me know.]
    Last edited by Paul W.; 6th May 2014 at 18:55.

  3. #3
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I was going to post something about some REALLY INTERESTING patterns in geo, but they were so darned interestingly weird I began to doubt my sanity, and suspect bugs in my code (or possibly fv), and I spent a few hours sanity-checking this and that in various ways, including poring over hexdumps. What fun.

    Now I've begun suspect the bug is in descriptions I've seen of the format of geo, which has been described as floating point data with an unusual one-byte base 16 exponent field represented as one sign bit and 7 unsigned exponent bits.

    I'm beginning to suspect it's not really that, and that the exponent field may actually represent something smaller than a byte, expanded out to a byte (maybe to expand binary to something 7-bit ASCII-transmissible?), creating artifacts you'd never see with any sane representation of floats I've ever heard of. Or I could be seeing something really weird in the actual data, which happen to have ranges and/or strides that interact oddly with the format weirdness, or maybe I have a subtle bug I still haven't nailed yet. I don't really know right now.

    Here's one thing I'm seeing: if I'm correct in thinking that the first byte of each aligned group of four bytes is the exponent, then the high nybbles AND the low nybbles of the exponent each tend to have strong regularities in ways you wouldn't normally expect from a single 8-bit integer field (either twos-complement or 1 sign plus 7 normal bits).

    For anybody who's savvy to all that, and interested, here's the regularity I was seeing, in more detail: If you look at the high 4 bits of the exponents through a delta filter, you see some longish string repeats at a stride of 36, i.e., the sequence of differences from the preceding high-4-bits-of-exponents fairly often repeats after exactly 9 4-byte floats, often matching 8 or more consecutive deltas. Weird, but maybe there's a stride at 9 floats in the actual data, which shows up in the high bits of the exponents, and that's that. But wait...

    What's really weird is that if you look at the deltas of the low nybbles of the very same exponent fields of the very same floats, you see a stride of 32, not 36, as though the deltas are disproportionately repeating every 8 floats, not 9.

    WTF? I have no clear idea right now. I might have a subtle bug somewhere that I still haven't caught, or there might really be a stride of a multiple of 8 and 9, or halfway in between such multiples, so that it somehow shows up as one with some filters, or the other with other filters. Or something else, I just dunno.

    One thing that is clear from hex dumps is that much of the geo data does have a strange regularity, much of the time if not always, where the both the low and high nybbles of the bytes aligned on 4-byte boundaries really do often have restricted ranges of values---the low nybbles do not vary nearly as much as they should if they're just the low bits of 8-bit numbers. They tend to have only a few values (noticeably less than 16) over considerable stretches, and are fairly runny.

    Given that, I'm going to write geo off for now. I just don't know what to do with it. It's too weird and obscure to bother with, except maybe as an exercise---or to beat somebody on the Calgary Corpus by a fraction of a a percent---and I'm not sure what the exercise is representative of, short of getting totally geeky about detectable patterns in oddball data.

    If people wants to toss me samples of similar data in a more modern and hopefully less bizarre format, I might try finding structure and redundancy in those.

    (I'm not much interested in "puzzle" data, like the output of PRNG's. I'm interested in real data that you'd have for real applications, which might be structured and heterogeneous, involving batches of different kinds of data, structures of heterogeneous data, and repetitions of similar batches or structures. You know, real data like real apps use, just not already compressed...)

  4. #4
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    BTW, despite my immediately previous post, I think the earlier posts are mostly right. Whether or not the geo format in particular is obscure in the specific subtle ways I mentioned, it usually works to use fv and simple filters to (1) find the basic small-scale stride and (2) identify the high-entropy and low-entropy offsets from that stride, then (3) start narrowing down subtler regularities.

    If you're looking at arrays and/or structs of something like bytes, shorts, ints, floats, doubles, and strings, you can usually sort it out with a little thought and experimentation, with fv and simple filters. (And the occasional peek at the actual data with a hex dump, to see the literals, if necessary.)

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I made a graph of geo and did some analysis at http://mattmahoney.net/dc/dce.html#Section_2
    You may be seeing oscillations in the sensor data with a period of 8 or 9 samples.

  6. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Thanks, Matt.

    How did you make the plot? Is the overall downward slope of each of the horizontal-ish lines an artifact of the plotting process, or do all the data in a batch have a steady downward trend, or what? Are the data plotted in batches, with the long lines representing the 1344-byte samples and the short ones the 320-sample batches?

    (Your plot looks sort of like a strip chart where channels are plotted closer together than their ranges, and can cross out of their space into their neighbors' spaces. But this is single-channel data, right?)

    The thing that makes me suspicious something's really weird with the data is that the first bytes tend to have high-nybble values of either C or 4, and low nybble values of 1, 2, or 3, and while the data are pretty runny, they tend to suddenly switch from values starting with C to values starting with 4 pretty frequently.

    If my mental model of base 16 floats is correct, but likely it isn't, I wouldn't expect that... I'd expect exponents for a discretized continuous function to usually vary by 0 or 1, so that the consecutive high nybbles would usually be identical and sometimes (rarely) change by 1, when the normalization changes. When the exponent changes by 1, the mantissa would change dramatically, being shifted by 4. (Giving you a much jaggier precision distribution for the mantissas than with base 2 floats, where the exponents would change by 1 more often, and the mantissas would shift more often, but only shift by 1.)

    The frequent changes of exponent bytes from 42 (or 41 or 43) all the way to C2 (or C1 or C3) have me stumped. In decimal, that's going from 66 to 194. It wouldn't be weird to see jumps like that in the mantissas, but in the exponents it seems to imply an enormous jump in scale, especially if they're base 16 exponents. (Or maybe I don't get what that actually means.) So, far, I can't make sense of frequently wobbling between such different scales.

    BTW, this shows up clearly in a hex dump of the data using the GNU/Linux "hd" command, with no optional arguments. By default, it shows rows of 16 hex characters, then an ASCII interpretation of those characters on the right. Values around 42 print as ASCII capital letters, and values around C2 are nonprinting, so they just get a dot.

    The raw data I've been plotting look like this:

    Code:
    00000000  c1 c8 c0 00 41 67 40 00  c2 12 f0 00 c2 21 c8 00  |....Ag@......!..|
    00000010  c2 24 54 00 c2 22 0c 00  c2 1f f0 00 c2 2c 44 00  |.$T..".......,D.|
    00000020  c2 47 fc 00 c2 4d f8 00  c2 63 58 00 c2 69 5c 00  |.G...M...cX..i\.|
    00000030  c2 2d c4 00 42 66 98 00  42 b0 dc 00 42 37 e4 00  |.-..Bf..B...B7..|
    00000040  c2 dc e8 00 c3 17 d8 00  c3 12 42 00 c1 2d 00 00  |..........B..-..|
    00000050  43 11 e8 00 43 12 9a 00  42 11 20 00 c2 ec 68 00  |C...C...B. ...h.|
    00000060  c2 e0 d4 00 c2 37 10 00  42 5e f0 00 42 ce 34 00  |.....7..B^..B.4.|
    00000070  42 99 2c 00 c2 3f 9c 00  c3 14 f8 00 c3 1a 00 00  |B.,..?..........|
    00000080  c2 b3 d4 00 42 ab 60 00  43 15 2a 00 42 c2 78 00  |....B.`.C.*.B.x.|
    00000090  c2 2c b8 00 c2 a7 84 00  c2 7b 34 00 42 38 2c 00  |.,.......{4.B8,.|
    000000a0  42 ec 1c 00 43 13 21 00  42 df 54 00 42 43 0c 00  |B...C.!.B.T.BC..|
    000000b0  c2 1c 7c 00 42 1f 3c 00  42 c5 10 00 43 11 6b 00  |..|.B.<.B...C.k.|
    000000c0  43 11 45 00 42 a9 90 00  42 30 a8 00 c2 48 b0 00  |C.E.B...B0...H..|
    000000d0  c2 97 2c 00 c2 94 b0 00  c2 71 d0 00 c2 5d bc 00  |..,......q...]..|
    000000e0  c2 49 5c 00 c2 35 20 00  c2 1d d0 00 c1 ce c0 00  |.I\..5 .........|
    000000f0  c2 18 f8 00 c2 2a e0 00  c2 32 78 00 c2 2e 7c 00  |.....*...2x...|.|
    You can see the columns of exponents, and the columns of zeros in the fourth bytes of fours.
    After deinterleaving 4x, it looks like this:

    Code:
    00000000  c1 41 c2 c2 c2 c2 c2 c2  c2 c2 c2 c2 c2 42 42 42  |.A...........BBB|
    00000010  c2 c3 c3 c1 43 43 42 c2  c2 c2 42 42 42 c2 c3 c3  |....CCB...BBB...|
    00000020  c2 42 43 42 c2 c2 c2 42  42 43 42 42 c2 42 42 43  |.BCB...BBCBB.BBC|
    00000030  43 42 42 c2 c2 c2 c2 c2  c2 c2 c2 c1 c2 c2 c2 c2  |CBB.............|
    00000040  c2 c2 c2 c2 c2 c1 41 42  42 41 42 42 42 42 42 42  |......ABBABBBBBB|
    00000050  42 41 c2 c2 c2 c2 c2 c2  c2 c2 c2 c2 c2 c2 c2 c2  |BA..............|
    00000060  c2 c2 c2 c2 c2 c1 42 42  42 42 42 42 42 42 42 42  |......BBBBBBBBBB|
    00000070  42 42 42 42 42 41 41 c2  c2 c2 c2 c2 c2 c2 c2 c2  |BBBBBAA.........|
    00000080  c2 c2 c1 c2 c2 c2 c2 c2  42 42 42 42 42 42 42 42  |........BBBBBBBB|
    00000090  42 42 42 41 41 41 42 42  42 42 41 41 41 40 41 c2  |BBBAAABBBBAAA@A.|
    000000a0  c2 c2 c2 c2 c2 c2 c2 c2  c2 41 42 42 42 42 42 42  |.........ABBBBBB|
    000000b0  42 c2 c2 c2 c2 c2 c2 c2  c2 c2 c2 41 42 42 42 42  |B..........ABBBB|
    000000c0  42 42 42 42 42 42 42 42  c1 c2 c2 c2 c2 c2 c2 c2  |BBBBBBBB........|
    000000d0  c2 c2 c2 c2 c2 c2 c2 c2  c2 c2 c2 c2 42 42 42 42  |............BBBB|
    000000e0  42 42 42 42 42 42 42 41  41 c2 41 41 c2 c2 c2 c1  |BBBBBBBAA.AA....|
    000000f0  c2 c2 c2 c2 c2 c2 c2 c2  c2 c2 c2 c2 c2 41 42 42  |.............ABB|
    The alternating runs of dots and ASCII printing characters on the make the alternation between C2 (plus or minus 2) and 42 (plus or minus 2) clear.

    (With a simple alphabet-mapping filter you can make that work better, making the different ranges of numbers you care about map to very different-looking printing characters, e.g., mapping numerically small numbers to visually small printing characters, like "." and ":", and large values to visually big ones like B, as in old "ASCII art" photo renderings. Unfortunately, you lose the ability to see the original raw data side-by-side with the ASCII interpretation.)
    Last edited by Paul W.; 7th May 2014 at 15:13. Reason: typo

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I wrote a program to draw the plot. The downward slope is an artifact of the program. The vertical position represents the position in the file plus the data value. The horizontal position is from the last header.

    The high bit of the first byte is the sign bit of the mantissa. The rest of the byte is the exponent as a power of 16 with 40 or c0 being an exponent of 16^0 = 1. The mantissa is MSB first and does not drop any leading bits and is always non-negative.

  8. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Paul W. (7th May 2014)

  9. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Thanks. That makes sense of a lot of things... C0, C1, and c2 are just -0, -1, and -2, and 40, 41, 42 are just 0, 1, and 2. It's a small contiguous range of common exponents, split by the sign bit.

    I should have seen that; I should have expected something like that.

    Given the runs of exponent values, and the fact that it's a continuous function, that suggests that just treating it as an array of 32-bit integers and using linear prediction would get you some benefit within those runs, but screw up on the low bits.

    Even something like WKdm probably shouldn't do too horribly badly, just separating out the low bits and (very crudely) recency coding all the high bits. (That often works passably on floats representing continuous functions, and doesn't know anything about exponents and bases anyhow.)

    Maybe I should make a parameterized filter like that for visualization plots---give it a word size and a fraction, and it would show you how MTF-codable the high part is---e.g., the top 1/4, the top 1/2, the top 3/4, or the whole thing. Ideally I'd figure out a way to do that for multiple sizes and fractions in one left-to-right plot of the file, so you could see where things look like runs of 2-, 4-, or 8-byte integers, and roughly how many low bits they tend to vary in.

  10. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    sao star catalogue from Silesia corpus

    I've done some basic examination of the sao star catalogue file from the Silesia corpus, with fv & my filters, finding the basic stride and some numerical fields.

    (I had no idea going in what a "star catalogue" would look actually like. So far, I haven't found the stage name or price fields, or lists of the movies they've been in.)

    If there's interest, I can post a bit about what I've found so far.

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Very funny. A quick Google search finds http://tdc-www.harvard.edu/catalogs/sao.html
    which describes a newer version. The difference is that silesia/sao lacks the 4 byte sequential identifier, so the record length is 28, not 32. The program below should help you visualize the data, although it doesn't display the 28 byte header correctly.

    Code:
    // Display silesia/sao on little-endian (x86) hardware. Public domain.
    
    #include <stdio.h>
    
    struct Star {
      double right_ascension;  // radians
      double declination;
      char spectral_type[2];
      short magnitude;  // x 100
      float ra_motion;  // radians per year
      float dec_motion;
    };
    
    int main() {
      FILE* in=fopen("sao", "rb");
      if (!in) return 1;
      Star s;
      while (fread(&s, 1, 28, in)==28) {
        printf("%14.12f %15.12f %c%c %5.2f %14.6g %14.6g\n",
            s.right_ascension, s.declination,
            s.spectral_type[0], s.spectral_type[1],
            s.magnitude*0.01,
            s.ra_motion, s.dec_motion);
      }
      return 0;
    }
    The output looks something like this:
    Code:
    ...
    0.217556454814  1.529096580937 G5  9.40   3.85427e-007  -2.42407e-008
    0.218371378131  1.456244113890 A2  5.50   2.54527e-006  -6.78739e-008
    0.218667284161  1.512264916519 F8  9.50  -1.34536e-006   2.90888e-008
    0.220182884450  1.486813022444 A5  8.20   1.01811e-006   6.78739e-008
    0.220232408167  1.529254824122 A3  9.50   5.30871e-007   3.87851e-008
    0.220705101507  1.554036172395 K5  9.50   2.34601e-005  -3.83003e-007
    0.221281205604  1.511015794069 G8  9.50   1.12719e-006   9.69627e-009
    0.221545550263  1.484077509729 G5  9.10  -9.38114e-007  -1.45444e-008
    0.222459957347  1.413902377755     9.10  -2.54527e-007   1.74533e-007
    0.224688015582  1.403066598057     9.00   1.19991e-006  -1.50292e-007
    0.228921529849  1.456263070105     9.10   1.09083e-006  -1.45444e-008
    0.229840372978  1.522208396637 G5  9.50   1.48353e-006  -7.27221e-008
    0.230730854506  1.438446167324 K0  8.90   5.96321e-007   1.98774e-007
    0.230781469055  1.471676848431     9.40   8.29031e-007    -1.309e-007
    ...
    The stars are grouped into 18 bands 10 degrees wide from 90 to -90 (in radians pi/2 to -pi/2). In each band, the stars are sorted by right ascension from 0 to 2*pi radians.

  12. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    a few pictures for sao

    OK, by the time I saw your mail I'd sort of half- figured out most of the fields for sao, which I though was doing ok, but was not sure of any of them. Not bad, I was thinking, but a good time to look at the actual format.

    After looking at a plain fv picture and seeing the clear stride, I deinterleaved on a stride of 28, and quantized the bytes (just looking at high 4 bits of each byte, per usual). I got this:

    Click image for larger version. 

Name:	saoq1_x28_sr4_fv.png 
Views:	222 
Size:	152.7 KB 
ID:	2896

    The wide noisy-looking feature that comes first covers five byte offsets, then the un-noisy black bar in position 1 covers three. And there's some interesting stuff over the first part of the black bar.

    That made me think I was looking at an 8-byte number field, LSB first, so that the five low-order bytes are too variable to look like anything by noise in this crude picture, the top two bytes are too repetitive look like anything but a black bar, and the one in between is "just right" to show interestingly-patterned variability.

    The next 8 bytes might be another---a bunch of too-variable bytes, and then a few bytes that mostly make runs. At the far end I was thinking that was likely two four-byte numbers, where the first 3 bytes are variable enough to look like noise, but the fourth (MSB) isn't. If so, then what comes before those two looks likely to be a two-byte number: a one-byte-wide noise column for the LSB, and a one-byte-wide very runny column for the MSB.

    So then I ran that through the delta coder, to see if anything numerically interesting happened:

    Click image for larger version. 

Name:	saoq1_x28_sr4_dlt_fv.png 
Views:	197 
Size:	158.8 KB 
ID:	2897

    The only particularly helpful thing that happened there is that for the first 8-byte number, the interestingly-patterned byte got a lot better---the interesting pattern moved way down, suggesting that the numbers were more linearly predictable than simply repetitive. So that probably is a number, and a number correlated with the file ordering, maybe a sort key.

    So my guesses turned out to be right, with two eight-byte numbers at the beginning, and a two-byte number and two four-byte numbers at the end. I hadn't made a guess yet about the two bytes in between---the one-character letter and digit fields---but if I'd looked at the usual "hd" default hex dump for the corresponding parts of the deinterleaved file, the ASCII interpretation part would have clearly shown one to be full of letters from early in the alphabet, and the other to be full of ASCII digits. I should have done that.

    Now I'm trying to figure out what filter/tool I need to make it easier to be surer whether the number fields you can see really are numbers, and help guess whether they're integers or floats.

    Plus I want some kind of plot that shows you how many alphanumeric character codes we're seeing at that point in the file vs. common Intel opcodes vs. how many small binary numbers. If I could line that up directly under the fv plot, we'd see right off that when we get to the two-byte mystery field, we suddenly see a lot of printing character codes, no small binary values, etc. It'd be obvious without having to look at a hex dump that we'd found a text field of some sort.

Similar Threads

  1. Benchmarking Entropy Coders
    By dnd in forum Data Compression
    Replies: 183
    Last Post: 27th June 2018, 13:48
  2. Deconstructing Bjork with fv and simple filters
    By Paul W. in forum Data Compression
    Replies: 16
    Last Post: 15th May 2014, 23:05
  3. ENTROPY 0.5
    By comp1 in forum Download Area
    Replies: 3
    Last Post: 7th February 2013, 00:21
  4. Preprocessors and filters and Nanozip??
    By kampaster in forum Data Compression
    Replies: 18
    Last Post: 9th July 2010, 20:42
  5. Replies: 33
    Last Post: 24th October 2007, 12:39

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •