> PAQ holds records for compression ratio and its models aren't complicated.
> Just a mapping chain of context to probabilities.
> I don't say it's a simple piece of software alltogether, though.
PAQ is a best argument for _my_ point actually
Its not only complex, its also hand-tuned throughout.
If you don't believe me, try getting the same quality out of it on
the data with randomly permuted byte values -
it'd be simpler to just write a new one.
book1_r = bitrev(book1)
wcc386_r = bitrev(wcc386)
book1 book1_r wcc386 wcc386_r sum_diff
ash04a /s255 202932 204086 260758 261426 -0.39%
ppmonstr vJ -o128 203247 205607 236758 241486 -1.61%
paq8o6 -7 192227 202909 194062 219119 -9.25%
Here we can see the level of tuning to "standard" data -
some "c>65" flags in ash's SSE context, much more of that
in ppmonstr's contexts, and lots of stuff in paq.
> In your answer you removed the most important thing - *But the difficult thing
> is to get the idea*
Because I wasn't talking about that.
> BTW: somewhere i read about language models, which used symbolic KI stuff
> to distingush word types and semantic relations. But i haven't seen any
> practical implementation of something like this.
There weren't any probably.
The problem is basically that there're not so many people able to write
a state-of-art CM compressor.
Basically, in that sense I'm confident only about Matt, D.Shkarin and myself
And that doesn't mean that we're especially smart or something - just because
most programmers still think that compression = LZ+huffman and don't have
enough experience with CM. And then its clearly impossible to gain anything
out of semantic correlation with likes of zlib. Even most of people here
are working on LZ and never tried to write a custom model for some specific
data type (preprocessors and paq modifications don't count).
> Can you clarify the difference between these SSE variants? I thought they
> basically do the same, you just use fewer bins and different representation of vertices
The main difference is update I guess. My update is similar to that mixer,
just that the weight is fixed here and SSE coefficients are updated instead.
But the idea is the same - to simulate an update of mixed probability as if
it was a counter.
> (despite of your implementation details like fp math).
That was 4 years ago. Now I use an integer implementation with 15bit counters,
similar to my demo coders.
> Why shouldn't it be a measure of predictability?
Well... By definition of "measure"?
The probability is measure of predictability and likelihood is, but higher
rank doesn't _always_ mean better predictability, so I can't call it a "measure" -
it doesn't have a monotonous mapping to predictability.
> highest rank usally is the lowest queue position, 0). I did some experiments and
> i can say, that symbol ranking (mtf under some context! ) outputs around 70%
> zero ranks (on "normal", compressible data,
That's normal, but doesn't really mean anything as a file recoded into
sequence of ranks would compress worse than original.
> I think that combining up to (average symbols per model)*(number of models) for
> coding one symbol is more computation intensive than combining (number of
> models)*8 per symbol coding step.
Well, you're wrong about that, as its actually faster than binary models for
redundant data. Of course we do have to compute the probabilities
of symbols with higher ranks (and its done eg. in ash), but the key is that
we only have to compute the probabilities up to the rank of current symbol,
so average amount is 1-2 such computations per symbol. And that's not anything
surprising if rank0 is the right guess 70% of the time, so you don't have to
compute any further.
>> I wasn't talking about bugs (just recently reported another bug in IntelC btw,
>> so there're some in STL too probably). Guess you don't understand why recent
>> paq8 versions contain a tiny STL replacement.
> Here again, you snipped off the important part: Don't do things twice,
> if you don't need to.
I skip what I don't need
> The only thing which i would interpret as "replacement" in PAQ
> is memory allocation which guarantees cache line alignment...
I meant Array and String templates and the fact that paq used to
#include STL headers but doesn't now.
> I'll answer the rest later, this took quiet much time.
Well, I'll keep you busy