I'm looking for work (or just search terms) on hybrid (lossless) compression using multiple techniques or dictionaries based on segmenting a file into similar-looking hunks, and switching back and forth between different compressors for different apparent categories of data.
Here's a motivating example: I was looking at a test file of Bulat's that's basically just a bunch of DLL's concatenated together, and it's (unsurprisingly) clear from a hexdump that it's got batches of different kinds of data interleaved in it, at a scale of anywhere from dozens to thousands of bytes: runs of machine code, runs of 16-bit Unicode, runs of mostly 8-bit ASCII null-terminated strings, and runs of data containing integers or or pointers aligned on some larger power-of-two boundaries.
That's unsurprising because you'd expect each DLL to usually have several different segments, containing at least a segment of actual code, another segment containing program literals, which might or might not have substantial text strings, and a link table consisting of mostly text strings (often in a different encoding). Depending on how those segments get combined in mostly static or dynamic linkage into larger library DLL's, similar segments from different object files may get concatenated together or left interleaved, and so on. ( Some of these things are driven by compiler/linker segregations of stuff into pages to be mapped executable, or read-only, or writable, and others are due to regularities in how people program and how compilers cash that out.)
Even within a single object file from a single source file, you often get similar segmentation, with machine code, read-only literals, initialization literals for writable data, debugging info tables, aligned struct literals separate from text literals, larger-aligned stuff preceding stuff with smaller alignment constraints, etc.
These general kinds of macro-scale regularities are very common in lots of other types of data, including in-memory data and "pickled" data structures (like Java "serialized" files). Memory allocators typically allocate similar objects in a given data structure more-or-less together in memory, one way or another. They're also common in various file formats for various applications, from word processors to 3D games, and often at several levels.
That makes me think that the first thing a compressor should do (conceptually) is to segment the data into reasonable-sized blobs of dozens to hundreds of bytes, and construct some stereotypes of the different kinds of chunks it sees---using different dictionaries or frequency tables or whatever, if not choosing different algorithms. Overly concretely, this could be "stuff that looks like code," "stuff that looks like structs," "stuff that looks like arrays of word-aligned ints or pointers," "stuff that looks like arrays of structs," and so on. Then it would try compressing the different categories as more-or-less separate streams with potential crosstalk. (E.g., when you see a few zeroes-followed-by-common-Latin1-letters, you up the probabilities of common unicode characters, but don't assume that's all you're likely to see. It might be a link table with alternating strings and pointers of some sort.)
This seems like an obvious idea that a zillion people must have had, that probably has a name or a few associated keywords I don't know. Intuitively, I'd call it "context", where you have "contexts" like "ascii-ish stuff," "code-ish stuff" "16-bit unicode-ish stuff" and so on, but "context" in compression lingo almost always means Markov-ish literal strings of preceding atomic symbols, and this isn't that.
I would think that somebody's done a lot of work on how to bottom-up classify blobs of similar-looking data within files to enable this sort of compression, conceptually similar to blob detection and foreground/background detection in image processing...
Can anybody point me to literature on this, or just terms of art? I know that there are related things (like some zpaq models' ability to model stride contexts) but I don't know the technical names for them, or terms for generalized treatments of how to detect and exploit them.