Once upon a time, cbloom talks about lzma
1. Tuned matrix as a generalization of match comparison heuristics
Surely its better to have a matrix than a bunch of hand-coded conditions like
With a matrix its replaced by
if( len==2 )
if( dist<256 )
encode a 2-byte match
encode a literal
Note that this is already more general than cbloom's version.
// price() quantizes the arguments into a single context
// and looks up a value in a table ("matrix") produced by parameter optimizer
cost1 = price(state1,flags1,len1,dist1);
cost2 = price(state2,flags2,len2,dist2);
if( cost1<cost2 )
But I also have a further idea - such heuristics are intended for
prediction of the best choice based on some inputs.
In other words, its the same statistical modelling task as compression,
so we can solve it with the same methods.
To be specific, we have a context (2 sets of match properties) and
a symbol (none/1/2 - we know the best choice after parsing optimization).
Thus we can tune a model which compresses the sequence of these symbols
(generated by processing a test corpus)
in those contexts, and get an optimized heuristic as a result.
2. "It's crazy that LZ compression can do so well with so much redundancy in the code stream"
> Start with the shortest compressed bit stream. See what raw bits that decodes to
I described a "CM with LZ model" before, which is probably a more practical version of that.
Enumerating the arithmetic codes and decoding is certainly an interesting way to enumerate
complex structures, but in this case looping through all matches available at current position
and integrating their codelengths by first referenced byte is clearly easier.
> add even *more* code stream redundancy by using things like the "repeat matches"
LZMA uses arithmetic coding and 11-bit probabilities, so it can encode constants
(fields that are the same in all records) with negligible redundancy.
Here's the test:
Note that IsRep flag is encoded for every match.
24557177 // enwik8.lzma
25021760 // enwik8.lzma processed with delrep to remove all rep codes
25002018 // Encoding of IsRep flag disabled
788.99 // rep support redundancy per MB of compressed data, bytes
197.42 // rep support redundancy per MB of uncompressed data, bytes
With that kind of overhead, we certainly can afford having lots of options
in the code, and a good parsing optimizer would improve the compression
without fail by making use of these options.