# Thread: Mixing strategies

1. ## Mixing strategies

Concerning the topic of "mixing" several questions occurred to me.
(Apologies for boring you with them...)

As far as I understood, a typical mixing strategy is as follows:
Compute weights w_i, 1 <= i <= n, for each model m_i (with
estimation probability p_i). Then the mix p is computed as follows:
p = (w_1 * p_1 + ... + w_n * p_n) / (w_1 + ... + w_n).

*** Question 1: Do negative weights make sense (and if so, in which cases)? I do not see any good "physical" interpretation of negative weights but I remember once having seen negative weights when it comes to mixing with the order -1 model. Maybe this is just a mathematical "trick" to increase/decrease probabilities beyond the restriction in question 2.

*** Questions 2+3: If the weights satisfy w_i >= 0, then we do have the restriction
min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
Is this a "severe" restriction in practice? Below, I try to explain what I mean by "severe".
Does neural network mixing (e.g. as in paq) also have this restriction (since it uses non-linear mixing)?

*** Question 4: Is there some public state-of-the-art computation of the weights w_i? Which one do you use?

*** Question 5: Are there any other non-linear mixing strategies used (which and where?)?

Concerning question 2: for simplicity, suppose you are given two probability estimations p_1 and p_2 (and of course, you know the input so far and what the intermediate estimations of the two models where). To me it seems that the "best" possible mix p (derived from the two models) does not need to satisfy the restriction in question 2, but of course, I do not have a proof -- since it would depend on the models. Can you think of any cases in which the best mix obviously does not satisfy the restriction (i.e., in which the restriction is "severe") or do you think that such cases cannot exist (i.e. the best possible mix has to lie between p_1 and p_2 if you have no further knowledge)?

Best,
Alibert 2. Originally Posted by Alibert Concerning the topic of "mixing" several questions occurred to me.
(Apologies for boring you with them...)
I like this topic I hope, the discussion becomes hotter  Originally Posted by Alibert As far as I understood, a typical mixing strategy is as follows:
Compute weights w_i, 1 <= i <= n, for each model m_i (with
estimation probability p_i). Then the mix p is computed as follows:
p = (w_1 * p_1 + ... + w_n * p_n) / (w_1 + ... + w_n).
Yes. This is well known linear variant of mixing. AFAIK, it's used in prior to PAQ7. The main disadvantage is it's division (=slow). Originally Posted by Alibert *** Question 1: Do negative weights make sense (and if so, in which cases)? I do not see any good "physical" interpretation of negative weights but I remember once having seen negative weights when it comes to mixing with the order -1 model. Maybe this is just a mathematical "trick" to increase/decrease probabilities beyond the restriction in question 2.
I can't see any sense behind negative weights. If this is allowed in above equation, then "p" can be negative too under some distinct circumstances. And this is an "out-of-range" for a probability in mathematical sense. Originally Posted by Alibert *** Questions 2+3: If the weights satisfy w_i >= 0, then we do have the restriction
min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
Is this a "severe" restriction in practice? Below, I try to explain what I mean by "severe".
Does neural network mixing (e.g. as in paq) also have this restriction (since it uses non-linear mixing)?
Final mixed prediction usually better than "best" model's prediction. So, this restriction is not in "must have" list. Same thing is also true for PAQ. Originally Posted by Alibert *** Question 4: Is there some public state-of-the-art computation of the weights w_i? Which one do you use?
For me, an ideal weighting schema should include real coding cost implicitly or explicitly. For example, PAQ's neural network mixing uses real coding cost (though it's still not ideal because of RC redundancy). PAQ's mixer is good enough to eliminate non-optimal models' predictions. So, if you feel your models don't output accurate predictions, then you can try PAQ mixer. But, if you think your modeling strategies are near-optimal, you can use some weaker but faster methods. Also, SSE like approach can be used as mixing too. Please have a look at here for a fast method: http://encode.ru/forum/showpost.php?p=2855&postcount=6 Originally Posted by Alibert *** Question 5: Are there any other non-linear mixing strategies used (which and where?)?
If you find, please let me know  Originally Posted by Alibert Concerning question 2: for simplicity, suppose you are given two probability estimations p_1 and p_2 (and of course, you know the input so far and what the intermediate estimations of the two models where). To me it seems that the "best" possible mix p (derived from the two models) does not need to satisfy the restriction in question 2, but of course, I do not have a proof -- since it would depend on the models. Can you think of any cases in which the best mix obviously does not satisfy the restriction (i.e., in which the restriction is "severe") or do you think that such cases cannot exist (i.e. the best possible mix has to lie between p_1 and p_2 if you have no further knowledge)?
As I said before, we don't have any restriction. I think, you should think mixing as totally another model which predicts the next symbol. Actually this is true. Because, "mixing" works on model predictions while "models" work on real symbols or intermediate symbols (=transformed symbols). So, mixing is kinda completely different model. But, again there is an implicit correlation to input symbols in any cases (else it would be useless).

Best Regards
Osman 3. Originally Posted by osmanturan I like this topic I hope, the discussion becomes hotter Me too! Thank you for the quick reply.

I can't see any sense behind negative weights. If this is allowed in above equation, then "p" can be negative too under some distinct circumstances. And this is an "out-of-range" for a probability in mathematical sense.
Of course, if an "out-of-range" is met, then one has to execute something like p := MIN_PROBABILITY or p := MAX_PROBABILITY. Now, don't get me wrong: of course this won't be very practical; but still two questions remain: can it be useful (i.e. enhance compression ratio)? And: is there some "physical" interpretation of negative weights.

Final mixed prediction usually better than "best" model's prediction. So, this restriction is not in "must have" list.
By using linear mixing with positive weights, this restriction comes automatically (even if you do not want to have it). Do you think that this might be the main disadvantage over PAQ's non-linear mixing?

... because of RC redundancy...
Sorry: what does "RC redundancy" stand for?

Also, SSE like approach can be used as mixing too.
Sorry again for my little knowledge: could you give some more details on this approach?

Please have a look at here for a fast method: http://encode.ru/forum/showpost.php?p=2855&postcount=6
Thank you for the link. If I understand correctly, this method is only capable of mixing two models at a time, i.e., mixing of more than two models is done by successively mixing two models. So, even if the mixing algorithm is nearly optimal when it comes to mixing two models, it is far from obvious that this is a good approach for mixing more than two models. Do you (or anybody else) have any more insight in mixing more than two models (and why the successive mixing of two models still performs quite well)?

If you find, please let me know I am working on it! Best regards,
Alibert 4. ## Hi!

I believe that all these very interesting mathematical rules are they complicate and that the solution better than the compression both instead to see in the simple one! 5. Originally Posted by Nania Francesco I believe that all these very interesting mathematical rules are they complicate and that the solution better than the compression both instead to see in the simple one!
Maybe mathematics complicates things at first, but I think that at some point both insight and accuracy always pays off (for example consider PPM vs. PPMII). 6. Originally Posted by Alibert Me too! Thank you for the quick reply.
You are welcome  Originally Posted by Alibert Of course, if an "out-of-range" is met, then one has to execute something like p := MIN_PROBABILITY or p := MAX_PROBABILITY. Now, don't get me wrong: of course this won't be very practical; but still two questions remain: can it be useful (i.e. enhance compression ratio)? And: is there some "physical" interpretation of negative weights.
Yes clipping is another option, but still I can't see any benefit of negative weights in linear mixing like you described (of course, I can be wrong too). Because, if it was, then a model would bias the other models by it's negative weight. Strange behaviour...  Originally Posted by Alibert By using linear mixing with positive weights, this restriction comes automatically (even if you do not want to have it). Do you think that this might be the main disadvantage over PAQ's non-linear mixing?
Yes. This restriction comes with it. But, the main advantage of PAQ's mixing is using near-real coding cost as weighting schema. For me, the main problem for designing a completely new mixer is finding a good and fast weighting method (mixing stage rely on weighting). Originally Posted by Alibert Sorry: what does "RC redundancy" stand for?
Sorry for acronym. RC is range coder. And Let me explain "RC redundancy": if we have a prediction P for next bit, then we need "-log2(P)" bits in theory for the next bit. But, in practical arithmetic coders or range coders have some redundancy (usually very small - i.e. on SFC, PAQ's RC has ~2000 bytes redundancy). So, this means a practical range coder tend to consume a bit more than theoretical values. And this is generally known as "RC redundancy". Originally Posted by Alibert Sorry again for my little knowledge: could you give some more details on this approach?
In classical markovian modeling, we try to predict some symbols under some contexts by producing probabilities about them. Secondary Symbol Estimation or shortly SSE takes this approach a bit further. It maps a context and probability pair to a new probability with a slowly learning counter. In electronics sense, it's kind of low-pass filter with extra helpful input (in here it's another context). There is good and short explanation in here: http://encode.ru/forums/index.php?ac...page=5#msg9819
You can also have look into PAQ3 source code. Originally Posted by Alibert Thank you for the link. If I understand correctly, this method is only capable of mixing two models at a time, i.e., mixing of more than two models is done by successively mixing two models. So, even if the mixing algorithm is nearly optimal when it comes to mixing two models, it is far from obvious that this is a good approach for mixing more than two models. Do you (or anybody else) have any more insight in mixing more than two models (and why the successive mixing of two models still performs quite well)?
Yes. It's only for two models at a time. But, it have it's own advantages. For example, you can cut down mixing tree at some points under some circumstances. I was trying to make multi-level mixer which relies on near-real coding cost. But, in the end, I had lots of very complex computations. Lookups could be used in here. But, this time it would be equivalent to PAQ's mixer. Originally Posted by Alibert I am working on it! If you make some progress, please let me know.

Best Regards
Osman 7. Originally Posted by osmanturan In classical markovian modeling, we try to predict some symbols under some contexts by producing probabilities about them. Secondary Symbol Estimation or shortly SSE takes this approach a bit further. It maps a context and probability pair to a new probability with a slowly learning counter. In electronics sense, it's kind of low-pass filter with extra helpful input (in here it's another context). There is good and short explanation in here: http://encode.ru/forums/index.php?ac...page=5#msg9819
You can also have look into PAQ3 source code.
If I understand things correctly, using SSE for mixing works by using the probabilities of the different models (maybe with additional context) as inputs. I suppose the "practical" approach again is successively mixing two models. Has anybody tried this approach with more than two models? Also, I do not see how interpolation works with more than two models.

It seems that neural networks is the only (??) approach which is used in practice and which is capable of mixing more than two models at once. It seems that all other (heuristical?) linear approaches can be improved by using the binary linear mixing algorithm explained earlier. Am I right?

Best,
Alibert 8. Originally Posted by Alibert If I understand things correctly, using SSE for mixing works by using the probabilities of the different models (maybe with additional context) as inputs.
Actually SSE needs one probability and one context for outputting one probability at a time. But, in mixing sense you can extend this idea a bit further. Instead of using single probability as input, you can design it as 2D map (actually it's a volume - 3D). You should keep it as small as possible. Because, small cached system suffer from this design. So, you have to use some quantization. For example:
Code:
```q1=Quantize(p1);
q2=Quantize(p2);
activeModel = &SSE[context][q1][q2];
p = activeModel.P();
...
activeModel.Update(y);```
In here, p1 and p2 are input probabilities. "Quantize" a quantize function (if you will use a lookup table, you have to keep it small as possible too). q1 and q2 are quantized probabilities. "activeModel" is counter model pointer. SSE is array of counter models. "p" is adjusted probability. "y" is just coded bit. I think, it's clear enough. Originally Posted by Alibert I suppose the "practical" approach again is successively mixing two models. Has anybody tried this approach with more than two models? Also, I do not see how interpolation works with more than two models.
Yep. Both me and Shelwien tried "successive mixing of two models". And probably also toffer tried that idea. PAQ9A's design relies also this mixing schema. You can use several combinations. For example:
p=MIX(p0, MIX(p1, MIX(p2, p3)));
or
p=MIX(MIX(p0, p1), MIX(p2, p3));
etc. Originally Posted by Alibert It seems that neural networks is the only (??) approach which is used in practice and which is capable of mixing more than two models at once. It seems that all other (heuristical?) linear approaches can be improved by using the binary linear mixing algorithm explained earlier. Am I right?
PAQ's mixer is widely known just because of PAQ's popularity. I'm sure there are another ways to do multi-input mixing. But, because of complexity we again need lookup tables and as I pointed out before there is a high chance to get equivalent code path.

I think, first of all you should decide one thing: near-ideal mixing (slow) or practical mixing (fast)?

If you want to go on practical mixing, you should keep in your mind common modeling techniques tend to produce probabilities near to 0/1. So, you can discard something like toffer pointed out. And you can have look of it's proof at this post: http://encode.ru/forum/showpost.php?p=4282&postcount=65

For motivation, you can build a very simple coder which mixes two models. And you can then improve it. Maybe you will find another good mixing schema in the end Good luck. I look forward to hearing any progress. 9. > *** Question 1: Do negative weights make sense (and if so,
> in which cases)? I do not see any good "physical"
> interpretation of negative weights but I remember once
> having seen negative weights when it comes to mixing with
> the order -1 model. Maybe this is just a mathematical
> "trick" to increase/decrease probabilities beyond the
> restriction in question 2.

An example where negative weights would be useful:
Suppose that we have an "inverted" binary model, which
gives a probability-of-0 estimation 0.0 when next bit=0
and 1.0 when next bit=1.
Then a perfect weighting scheme would be
p_1 = 0.5
p_2 = inverted estimation
p = ( 2*p_1 + (-1)*p_2 ) / ( 2 + (-1) ) = 1-p_2
And in fact, its quite a practical case, as with
context sequences similar to 01010101 most of common
estimators would be one step behind.
But imho its better to handle such cases by specific
models, or at least by assigning separate weights to (1-p_i)
values, instead of allowing negative weights.

> min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
> Is this a "severe" restriction in practice?

This restriction might really cause a redundancy in some cases.
But its really easy to fix - by including p_0=0 and p_n+1=1.

> *** Question 4: Is there some public state-of-the-art computation of the weights w_i?
> Which one do you use?

Aside from straightforward ad hoc models like ppmy and ash (which directly
use the weighting scheme you described) and paq version its hard to name something,
as its a "slow" approach.
Also paq's weight update function won't be particularly useful in linear space.
So its kinda hard to tells what's "state-of-art" here.
But in fact its possible to derive a weight distribution from probability distributions
given by any practical PPM or CM models (and even LZ, if you really try).

Anyway, there's certainly no popular data type produced by a generator based on
a linear vector mixing model, so there's not much reason in finding a "perfect"
weight estimation method for such a scheme.
I can give you a possible candidate, though - after processing each symbol you
can perform a weight optimization for minimal compressed size of already processed
data (using GA or any other optimization algorithm).

> *** Question 5: Are there any other non-linear mixing strategies used (which and where?)?

Not really, as there's not much "physical sense" in that (you can't explain how even
paq's model can find a perfect asymptotical approximation for simple source models like:
b1 = (random(0..1)>=p1);
b2 = (random(0..1)>=p2);
output( random(0..1)>=p3 ? b1 : b2 );
though with linear weighting it would be possible:
p = (1-p3)*p1 + p3*p2 )

But, afair, CTW's mixing is non-linear as well.
Also, any PPMs (due to symbol masking) and any coders with secondary estimation. 10. Originally Posted by Shelwien > *** Question 1: Do negative weights make sense?

An example where negative weights would be useful...
Thank you for your simple yet enlightening example. As far as I understand, using negative weights can eliminate "bad influence" of some models in other models; of course, the main problem is identifying such correlations (both qualitatively and quantitatively).

But imho its better to handle such cases by specific
models, or at least by assigning separate weights to (1-p_i)
values, instead of allowing negative weights.
Could it be that separate weights for p_i and (1-p_i) yield inconsistency (i.e. weighted probabilities do not sum up to 1)?

> min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
> Is this a "severe" restriction in practice?
This restriction might really cause a redundancy in some cases.
But its really easy to fix - by including p_0=0 and p_n+1=1.
I suppose one has to use fixed weights w_0 and w_n+1, since adaptive weights in most settings will tend to 0. On the other hand, in linear mixing this will simply yield the addition of a number which only depends on the other weights. This looks strange... (maybe I am just a little bit confused ).

I am still unsure whether PAQ's neural network mixing does have the same restriction (which of course could be solved using your p_0 / p_n+1 approach). To me, it appears as if SSE is the only approach (which is currently applied) to get rid of the above restriction.

<mixing> is a "slow" approach.
Do you also consider toffer's 0/1 modification as a slow approach? (This is nothing personal; to me toffer's approach seems very efficient and I am curious about your opinion on what to consider "slow".) Maybe the "slowness" comes from the multiple models (which in a lot of cases are doing rather similar things; e.g. Markov models of different orders, use of determinism in sources by applying KT-estimators and match predictors,...).

Anyway, there's certainly no popular data type produced by a generator based on
a linear vector mixing model, so there's not much reason in finding a "perfect"
weight estimation method for such a scheme.
Sorry, what do you mean by "generator"? I am afraid, I cannot follow your statement. Could you please explain it a little further?

I can give you a possible candidate, though - after processing each symbol you
can perform a weight optimization for minimal compressed size of already processed
data (using GA or any other optimization algorithm).
This does not seem to be very "practical" but on the other hand, it surely would be interesting for comparisons. Thank you for sharing!

Best,
Alibert 11. > Could it be that separate weights for p_i and (1-p_i)
> yield inconsistency (i.e. weighted probabilities do not
> sum up to 1)?

I mean like (w1*p1+w2*(1-p1))/(w1+w2)... so no inconsistency.

> I suppose one has to use fixed weights w_0 and w_n+1,
> since adaptive weights in most settings will tend to 0.

I don't think so. Also anyway, if they would be really adaptive,
then they would also adapt specifically for the cases where
optimal p is out of min/max bounds.

> Do you also consider toffer's 0/1 modification as a slow
> approach?

There's PPM which attempts to encode most of the
symbols using the probability directly taken from statistics
of a single context.
Comparing to PPM, mixing is always slower, even if its
a static mixing (that is, a speed-optimized PPM would be
faster than CM with the same compression level, but
reverse is not true).

> (This is nothing personal; to me toffer's approach seems

Actually there were people who used such things
like 10 years ago (Serge Osnach, for example) Also, here's an original post on that kind of mixer:
http://ctxmodel.net/rem.pl?-12
(It should be somewhere in the old forum too).

> very efficient and I am curious about your opinion on what
> to consider "slow".)

I don't really care about slowness if its computable.
In fact, I don't care about practical compression either.
For me, compression is just an objective measure of
model's performance.

> Maybe the "slowness" comes from the multiple models

Well, that too (as CM commonly updates all the submodels
while PPM doesn't need to update or even access them).
But also adaptive mixing involves multiplication, and
that's not exactly fast.

Just for reference, it seems that people only like
compressors with >2Mb/s speed (for practical use).

> Sorry, what do you mean by "generator"? I am afraid, I
> cannot follow your statement. Could you please explain it
> a little further?

Ah, well.
There's a common "magical" approach to compression, where
people try to invent a simple universal formula which would solve
all the problems.
But in fact its kinda different.
For each data sample, we have a source, which generates it by
a certain algorithm (http://cs.fit.edu/~mmahoney/compression/uiq/ is related).
So an ideal compressor should find the best fitting parameters for that algorithm
by given data.
But commonly the source algorithm is either too complex (like with executables
compiled by gcc), or unknown ("natural" text).
So we can only build very approximate models, based on visible dependencies in
the data. And what do we do when there're different kinds of dependencies in
different parts of data? If we can't clearly differentiate the type of data,
then we can only assign probabilities to these dependencies, but to avoid redundancy
we have to integrate the symbol intervals from different submodels, and it
becomes a linear mixing scheme. And if there're some conditions affecting
the submodels probabilities, that becomes your weighting function.
But submodels are important here, not the "universal" mixing.
There's no reason to improve the mixing, while trying to compress audio data
using text-oriented submodels.

> > I can give you a possible candidate, though - after processing each symbol you
> > can perform a weight optimization for minimal compressed size of already processed
> > data (using GA or any other optimization algorithm).
>
> This does not seem to be very "practical" but on the other
> hand, it surely would be interesting for comparisons.

Well, it is practical enough. Once I spent some time developing a
series of mixer implementations (mostly) based on something called BFA
(Bruteforce Adaptation). The initial idea was that we can restrict
the weight value set and accumulate the code length for all possible
values, but there appeared to be a lot of ways of optimization. 12. The following seems to be a new mixing strategy (beta weighting):

http://www.informatik.uni-stuttgart....kbf09wic30.pdf

Does anybody know how it compares with

http://encode.ru/forum/showpost.php?p=2855&postcount=6

(speed/compression ratio)?

Cheers,
Alibert 13. I'm very glad to see that Edgar Binder is still there and works on data compression,
so thanks for posting the link.

But as to the mixer itself, it seems to be pretty much bayesian, so nothing new.
Of course, it doesn't mean that the proposed scheme is bad or doesn't work.
But still there're some flaws:
1. Its slow and there's a lot of precision-related troubles.
2. Its best for generators based on static order0-controlled
switching (not mixing) of stationary sources, and such "natural"
generators basically don't exist.
3. Its not so visible in the form of bpc, but the best listed result
for book1 seems to be 2.249/8*768771=216121 (with 25 mixed
submodels), while http://ctxmodel.net/files/tree_v1.rar produces
215719 with up to 17 submodels and static weighting.

Thus, I can't say that practical implementations of this method
are impossible, but they might appear surprisingly similar to some other
methods. 14. Interesting paper. But the calculation of p in section 3.1 immediately shows why the results are poor. It is the wrong way to mix probabilities no matter what weights you use. You can't just average them. You need to also consider the confidence of each model.

Suppose you have two models that predict that the next bit will be a 1 with probabilities 0.5 and 0.98. The first model doesn't know, but the second is highly confident that the next bit will be 1. Therefore the second model should get greater weight. Linear mixing would predict 0.74, but a better prediction would probably be 0.9. Depending on weights to handle this won't work, because confidences depend on context which can change rapidly from one bit to the next.

Both paq6 and paq8 use confidences in addition to adjusting model weights. In paq6, each model outputs a count of zeros and ones which are weighted and summed. A prediction near 0 or 1 requires one of the two counts to be large, which means that the sum of both counts will also be larger than if the prediction is near 0.5. paq8 models outputs probabilities but they are first stretched (p -> log(p/(1-p)) before mixing and squashed (inverse of stretch) afterward. Stretching gives large absolute values when p is near 0 or 1. A prediction of p = 0.5 has weight 0.

It is not clear from the paper just what compression algorithm they are using, but they mention that using a separate weight for each model state improves compression but they don't know why. Well, I will tell you why. Because probability and confidence both depend on state. If a prediction is near 0 or 1, it will usually be right, so those weights will grow. 15. http://ctxmodel.net/files/mix_test_v0.rar

Simple bytewise coding demo (o0/o1) for mixer comparison

Code:
```book1bwt
768771 - Uncompressed
323522 - Plain order0
282728 - Plain order1
273999 - Static order0+order1 mix (optimized)
271401 - Adaptive order0+order mix (optimized)```
I'd like someone (preferably Matt) to attach the paq mixer to
this sample, to practically demonstrate how superior it is.

Statistics are frequency-based there, like in old compressors,
and are relatively unfit for BWT data.
But there's an improvement from using static and then adaptive
mixing, so a superior universal mixer should provide even better
results without changing the inputs. 16. By toffer's request here're some versions more compatible
with paq mixer... apparently it only works with binary
distributions or something.

http://ctxmodel.net/files/mix_test_v1.rar
Code:
```323043 - Plain order0
282051 - Plain order1
273806 - Static order0+order1 mix
271539 - Adaptive order0+order1 mix

v1
- Probabilities accessible at mixing stage
- Reoptimized to probabilities (slightly worse results)```
http://ctxmodel.net/files/mix_test_v2_bin.rar
Code:
```235937 - Plain order0
235584 - Plain order1
216816 - Static order0+order1 mix
216098 - Adaptive order0+order1 mix

v2
- Alternate bitwise model
- Reoptimized```
Still, I'd really prefer to get a mixer improvement for v0/v1 than
for v2 (as its less trivial) - its exactly the mixer's purpose to
get all the possible use out of any submodels, even bad ones.
But being able to evaluate v2 with paq mixer would be helpful too. 17. > Interesting paper. But the calculation of p in section 3.1
> immediately shows why the results are poor.

Yeah... Along with bayesian statistics in whole, I guess.

> It is the wrong way to mix probabilities no matter what
> weights you use. You can't just average them.

1. Static mixing is fast. And simple adaptive linear
mixing is fast enough too.
2. Linear mixing is the obvious solution for generators
with source model switching. Mixer is the model of
source switching control stream.

I really wonder what's your theory about degenerate cases
like with all submodel inputs being static.
But to me it seems that paq mixer output would converge
really fast and won't adapt to data at all in such a case.

> You need to also consider the confidence of each model.

Sure, its very helpful.
But first its necessary to define that "confidence".
It might be relatively clear for high orders which
don't have much statistics, but not all models are
like that. And including some ad hoc value into
the probability estimation would usually hurt the compression.

> Suppose you have two models that predict that the next bit
> will be a 1 with probabilities 0.5 and 0.98. The first
> model doesn't know, but the second is highly confident
> that the next bit will be 1. Therefore the second model
> should get greater weight.

To me it sounds totally wrong.
That is, if we'd decide that based on numbers of model context
occurences (or some smarter things), it might be ok.
And it might be also reasonable for some specific compressor
implementations, where probability is initialized with 0.5,
so that the ~0.5 estimation would likely mean an "uneducated"
model.
But that's totally implementation specific, as its obvious
that there might be completely random bits, with their
locations predictable with 100% confidence.

> Both paq6 and paq8 use confidences in addition to
> adjusting model weights. In paq6, each model outputs a
> count of zeros and ones which are weighted and summed.

Now, that's really useful, but fast compressors might
not be able to afford that.

> It is not clear from the paper just what compression
> algorithm they are using,

Its described in section 4 - DMC and a bitwise match model.

> Because probability and confidence both depend on state.
> If a prediction is near 0 or 1, it will usually be right,
> so those weights will grow.

Afaik its better to decide that based on secondary statistics. 18. Here are some results for book1bwt with zpaq. Code:
```(results for zpaq c(this_file) archive book1bwt)
(note: all parameters are tuned)
comp 2 0 0 0 (hh hm ph pm)
(uncomment any of below to run)
(235069 simple order 0)   (1 0 cm 9 3)
(237916 simple order 1)   (2 0 const 0  1 cm 17 5)
(220811 static mixer)     (3 0 cm 9 3   1 cm 17 5     2 avg 0 1 120)
(219581 adaptive mixer)   (3 0 cm 9 3   1 cm 17 5     2 mix 0 0 2 14 0)

(221070 indirect order 0) (1 0 icm 4)
(231883 indirect order 1) (2 0 const 0  1 icm 12)
(215055 indirect chained) (2 0 icm 4    1 isse 12 0)
(214936 ind static mixer) (3 0 icm 4    1 icm 12      2 avg 0 1 160)
(214514 ind adapt mixer)  (3 0 icm 4    1 icm 12      2 mix 0 0 2 14 0)
(214772 chained & mixed)  (3 0 icm 4    1 isse 12 0   2 mix 0 0 2 14 0)
hcomp
d= 1 a<<= 9 *d=a halt   (set order 1 context for model 1)
post
0
end```
Note the const models don't do anything. I put them there so I didn't have to rewrite the hcomp code.

To keep it simple, contexts are direct lookups (or equivalently no hash collisions) and there is no mixer context. I tuned static mixer weights (avg), adaptive mixer learning rates (mix) and simple context model minimum adaption rate (cm) for max compression.

I could reduce sizes by 20 bytes by using b command instead of c (to omit SHA1 checksum).

As you might know, zpaq does all mixing (static or adaptive) in the logistic domain (stretch(p) = log(p/(1-p)) with a final squash at the end (squash(p) = 1/(1+exp(-p))).

Direct models (cm) map a context to a prediction and use a variable adaption rate that starts fast and depends on a count. Indirect models (icm) map a context to a bit history and then to a prediction which is adapted at a fixed rate.

Chaining (isse) maps a context to a bit history which selects a mixer context for the previous model and a fixed constant.

http://cs.fit.edu/~mmahoney/compression/#zpaq 19. Well, thanks for information.
But its kinda unrelated to what I was talking about.
That is, doesn't prove that logistic mixer is better than linear one or anything.

Fortunately, it seems that toffer did the thing I asked about, instead.
Would look at it, maybe tune the parameters, and post. 20. http://ctxmodel.net/files/mix_test_v3.rar

So here we have a proof that a paq-like mixer
is better than the one from "sh_mixer.inc".
But now the question is whether its possible
to design a linear mixer equivalent to logistic one,
with "confidence".

Code:
```215703 - dual fixed-rate linear mixer
214695 - with paq-like logistic mixer

- Added a logistic mixer provided by toffer
- Added a second instance of linear mixer (with same inputs)
for more fair comparison (so that both mixers would have
two state variables)

ToDo:
- Implement a linear mixer based on a counter
with bit frequencies
- Optimize the stems[] table for paq_mixer
- Clean paq_mixer and add more parameters (lim & mw in update)```
P.S. Already doing that stems[] optimization... 214339 atm,
kinda interesting. 21. Thanks for testing. I can't do linear mixing with zpaq without changing its basic architecture. But I did lots of experiments like this years ago on older versions of paq and also more recently with early versions of zpaq. As I wrote the spec I realized that making all predictions logistic was a more elegant solution (mixers and SSE are much simpler) and there was no reason to have linear predictions at all until time to arithmetic code.

Logistic mixing is equivalent to linear mixing if the models give similar outputs, like with SSE bypass mixing in paq8. (zpaq would still uses logistic mixing for this). That looks like what the beta weighting did, since they were mixing DMC models with different parameters, rather than mixing different orders. The exception was the match model, but it seems possible to mix well because beta weighting adapts rapidly and the match model tends to stay on or off for long periods. I would be surprised if you could get good results with linear mixing of different orders (not BWT). 22. > I would be surprised if you could get good results with
> linear mixing of different orders (not BWT).

Are these results that bad: http://ctxmodel.net/files/MIX/mix_v6_SFC.htm ?
Or what about Ash? (http://ctxmodel.net/files/ASH/)
I only use linear mixing and can't say I'm dissatisfied with its performance.

Btw here's the new stems table (the result didn't get any better than 214339)
http://ctxmodel.net/files/mix_test/stems1.png (New values in red).
And here's the corresponding y=stretch(x) plot:
http://ctxmodel.net/files/mix_test/stems2.png
(new values in red, (new-old)*4 in green). 23. When talking with Eugene about the PAQ mixer i came accross the derivation of the logistic mixer again. I never read it anywhere so i decided to post it.

The PAQ mixer

Coding cost h of bit y:

h = -log2(P(y)) = -1/ln2 * ln(P(y))

Let p = P(y=1) be a probability estimation and a=1-y, b=2*y-1,
thus we can rewrite - dropping the constant:

-ln(a+b*p)

A logistic mixer (PAQ) computes the prediction mix p as:

p = Sq(x)
x = sum j w_j*x_j
x_j = St(p_j)

where x_j is the stretched input probability p_j. St(x) = ln(x/(1-x))
and Sq(x) = 1/(1+e^-x). Gradient descent can be used as an iterative
solver for the coding cost minimization:

w_j' = w_j + L*d_j
d_j = -dh/dw_j

Hence the gradient grad H = (d_0 d_1 ... ) is required:

dh/dw_j =
= - d/dw_j ln(a+b*p)
= - b/(a+b*p) * dp/dw_j
= -1/(a/b + p) * dp/dw_j

The partial derivative of p = Sq(x) (x, Sq(x)see above) is

d/dw_j Sq(x) =
= dSq/d_x dx/d_wj
= Sq(x)*(1-Sq(x))*x_j
= p*(1-p)*x_j

Thus dh/dw_j becomes

dh/dw_j =
= -1/(a/b + p) * p*(1-p)*x_j

Evaluating a/b for y={0,1} yields:

a/b + p, y=0 -> -1 + p = -(1-p)
a/b + p, y=1 -> 0 + p = p
a/b = -(1-y)

Again, reinsertion into dh/dw_j gives us

y=0: -1/(-1*(1-p)) * p*(1-p)*x_j = -(0-p)*x_j
y=1: -1/p * p*(1-p)*x_j = -(1-p)*x_j

Which can be simplified using e=y-p

dh/dw_j = -e*x_j = -(y-p)*x_j

Finally a weight update using gradient descent can be obtained via

w_j' = w_j + L*(y-p)*x_j 24. ...and there's something more we came across. Exponential decay is a strategy of modeling a non-stationary processes, and usually can be expressed like:

v(k) = sum i=0..k-1 c^(k-1-i)*w(i) = w(k-1) + c*w(k-2) + c^2*w(k-3) ...

Iteratively it can be expressed to be (0<=c<=1):

v(k) = c*v(k-1) + w(k)

Something interesting happens to a gradient weight update:

w(k) = w(k-1) + L*d(k)

Which doesn't look like exponential smoothing at all (otherwise it'd look like
w(k) = c*w(k-1) + L*d(k))

Now suppose the constant L to be decomposed into (1-c)*M:

w(k) = w(k-1) + (1-c)*M*d(k)
= w(k-1) + (1-c)*M*d(k) - c*w(k-1) + c*w(k-1)
= c*w(k-1) + w(k-1) + M*d(k) - c*w(k-1) - c*M*d(k)
= c*w(k-1) + (1-c)*(w(k-1) + M*d(k))

Applying this decomposition we see that the w(k) update can be interpreted to look like:

w' = c*w_old + (1-c)*w_new

Which is called exponential smoothing and related to exponential decay:

w(k) =
= c*w(k-1) + (1-c)*d(k)
= c^2*w(k-2) + c*(1-c)*d(k-1) + (1-c)*d(k)
= c^3*w(k-3) + c^2*(1-c)*d(k-2) + c*(1-c)*d(k-1) + (1-c)*d(k)
...
= c * [c^(k-1)*w(0)] + (1-c) *[ c^(k-1)*d(0) + c^(k-2)*d(1) + ... c*d(k-1) + d(k) ]

Obviously w(k) is a linear combination of w(0) = w0 aged k-1 times (initial weight) and a exponential weighted sum d(i).

The odd thing is that w/o applying such a mathematical trick L=(1-c)*M - which is valid and doesn't change anything to the weight update formula one cannot see that exponential smoothing is involved here:

w(k) = w(0) + L*(d(1) + d(2) + ... d(k))

Any comments on that? 25. 'w' stands for weight.
'c' stands for context?
'v' stands for mixed output?
'L' stands for?
'd' stands for?
'M' stands for Model?

Yes I am a noob. 26. L, M, c are real scalar constants. I assumed L = (1-c)*M.

w(k), v(k), d(k) are some time series, k is the current step.

Just to show a general form of exponential weightend estimation:

explicit formulation
(1) v(k) = sum i=0..k-1 c^(k-1-i)*w(i) = w(k-1) + c*w(k-2) + c^2*w(k-3) ...

implicit formulation
(2) v(k) = c*v(k-1) + w(k)

A gradient descent weight update (as with the logistic mixer):

(3) w(k) = w(k-1) + L*d(k)

here w(k) is a weight and L*d(k) a weight change.

w(k) = w(k-1) + L*d(k)
= w(k-1) + (1-c)*M*d(k) + ( 0 )
= w(k-1) + (1-c)*M*d(k) + (- c*w(k-1) + c*w(k-1))
= c*w(k-1) + w(k-1) + M*d(k) - c*w(k-1) - c*M*d(k)
= c*[w(k-1)] + (1-c)*[w(k-1) + M*d(k)]

(4) w(k) = c * w_old + (1-c) * w_new

w_new = w(k-1) + M*d(k)
w_old = w(k-1)

(4) was derived from (3) using a trick. (4) and (3) are equivalent, but (3) doesn't have a form like exponential smoothing (2), whereas (4) has a form like:

(5) w(k) = c*w(k-1) + (1-c)*x(k)

x(k) = w(k-1) + M*d(k)

(5) contains exponential smoothing like in (1):

w(k) = c*w(k-1) + (1-c)*x(k)
= c*w(k-1) + (1-c)*x(k)
= c^2*w(k-2) + c*(1-c)*x(k-1) + (1-c)*x(k)
= c^3*w(k-3) + c^2*(1-c)*x(k-2) + c*(1-c)*x(k-1) + (1-c)*x(k)
...
= c * [c^(k-1)*w(0)] + (1-c) *[ c^(k-1)*x(0) + c^(k-2)*x(1) + ... c*x(k-1) + x(k) ]

EDIT: Problem solved - compared apples to oranges. A sum of weight changes d(k) equals an exponential decayed sum of actual weights w(k)  27. Thanks Toffer. I will study this carefully. 28. Toffer, I think that is the same derivation I used when I designed the new mixer for paq7. What I found interesting was that the weight update was simpler than back-propagation. For back-propagation, the goal is to minimize the mean square output error rather than the coding cost. When you differentiate w_i with respect to squared error (y-p)^2, you get:

w_i := w+i + L*x_i*(y-p)*p*(1-p)

where w_i is the weight, L is the learning rate, x_i is the input, y is the desired output and p (0 < p < 1) is the actual output. When you differentiate w_i with respect to coding cost of bit y with probability p, the factors p*(1-p) go away. I mentioned that in the paq7 comments but omitted the derivation. 29. MSE makes some sense but would be a wrong metric there.

Luckily the form of stretch concides with the model output histogram (count the number of times at which probs. the bits are coded) - higher resolutions near 0/1, where most predictions are made. And in this case the actual probability concides with some kind of "confidence" as well. But a high probability is not realated to some confidence at all.

There're some polts of model histograms:
http://encode.ru/forum/showthread.php?t=159

That would be completely different for a bytewise model. The NN mixer wouldn't work as well in that case. In my eyes that is caused by the alphabet decomposition. 30. Is there a mixing strategy when there is a variable amount of models involved?
I?ve tried to implement the Paq mixer into my compression algorithm, but I found out that the Paq mixer expects a constant amount of models for its input. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•