1. ## Logistic mixing

Originally Posted by Shelwien
The mixer is a 3-way logistic paq mixer.
The minimization is done with respect to prediction error in the logistic domain.
Maybe it would be better to minimize with respect to coding i.e. -log_2(p) in logistic domain too.
There 's also an obvious improvement to all these gradient descent approaches, but I think it just would cost too much speed.

2. Wrong. It minimizes coding cost - just have a look at my recent post

3. Thanks, I guess I missed a constant in http://ctxmodel.net/files/mix_test/stems5.png
Not that it matters though

4. Originally Posted by toffer
Wrong. It minimizes coding cost - just have a look at my recent post
Ah ok, i really wondered why the "back-propagation" term is missing in the update rule.
Your derivation is very clear and leads to a nice simple update formulation, thanks for that .

I currently use something like that with 2-input mixing for p
w += r*d, where d=(p2-p1)/p when bit=1, or d=(p1-p2)/(1-p) when bit=0
which is derived from minimizing coding cost in linear domain.

did you try finding the minimum on the line f(w+r*d)=min! instead of using fixed step-size?

5. Ok, here's a comprehensive answer.

6. Yeah, I did a similar derivation when I introduced logistic mixing to PAQ7 but only commented the result. I noticed that adjusting the weights along the gradient of coding cost in weight space led to a simpler update rule than for back propagation, which minimizes root mean square prediction error. So if the actual output is p (a prediction between 0 and 1) and the desired output is y (the update bit 0 or 1), then back propagation says:

w[i] += L*x[i]*(y-p)*p*(1-p);

where L is the learning rate (around .001 to .01) and w[i] is the weight for input x[i]. Minimizing coding cost gives you

w[i] += L*x[i]*(y-p);

I thought it was neat that the extra terms went away. I had done the same analysis for linear mixing in PAQ6, where I took the partial derivative of coding cost with respect to weights. The equations were a little more complex but it did eliminate L as a parameter. For linear mixing you have to maintain w[i] > 0 but it's not required for logistic mixing.

7. The actual difference between cost minimization in logistic or linear domain is very small in my example but it all could be because I've done something wrong

I made different 2-input mixers. In my first version the linear mixing was actually better, but after literally copying the mixer from PAQ9a it went a little better.

This is the result on calgary corpus with literals coded by an O2-O1-O0-mix powered by an LZ-engine.
The mixer for Mix(p2,Mix(p0,p1)) is selected via actual coded byte and parts of last-byte.

difference could be because we lack parameter optimization, but funny how close they are (only have to change 1 line in the code)

Code:
3.141.622 bytes
linear mix:   845.585 bytes
logistic mix: 844.903 bytes
Originally Posted by Matt Mahoney
but it did eliminate L as a parameter
But L is the step-size on the gradient line, and the linear mix should have the same speed as the logistic variant.
Because we are able to replace the division in d=(p2-p1)/p when bit=1, or d=(p1-p2)/(1-p) when bit=0 with table-lookup multiplication.
and then we have w+=L*d which needs 2mul, 1add per bit, just like the logistic variant.

8. I describe linear and logistic mixing here: http://mattmahoney.net/dc/dce.html#Section_431

In a linear mixer, S0 and S1 (evidence for 0 and 1) start small and grow over time. This causes the weight updates to start large and get smaller over time. The reason you don't have a learning rate parameter is that you are mixing evidence (bit counts) to get probabilities. In a logistic mixer, the inputs are already probabilities. The rapid adaptation at the start comes from the models rather than from the mixer.

9. Ok, I understand linear mixing as p=p1*w1+...+pn*wn avoiding the squash and stretch.

But if your update formula for linear evidence mixing is wi := max[0, wi + (y - pi)(S n1i - S1 ni) / S0 S1],
(y - pi)(S n1i - S1 ni) / S0 S1 should be the negative gradient. Thus you're able to add a step-size on weight update.
It doesn't matter if you retrieve the formula from bit-counts, because you interpret them as a probability, but maybe I have to start Mathematica again

10. > Ok, I understand linear mixing as p=p1*w1+...+pn*wn avoiding the
> squash and stretch.

Its not really about avoiding anything.
In fact, both logistic and linear mixing are based on the same source model,
where each symbol is generated by a single submodel, and weights are probabilities
of submodel usage.
Just that you get linear mixing by summing symbol intervals
(ie there's a total probability of w1*p1 that symbol would be encoded
with model1 and w2*p2 for model2 - and we don't care which model it is,
so symbol probability is w1*p1+w2*p2)
and logistic mixing by summing likelihood intervals (string probabilities):

> But if your update formula for linear evidence mixing is wi := max[0, wi + (y - pi)(S n1i - S1 ni) / S0 S1],
> (y - pi)(S n1i - S1 ni) / S0 S1 should be the negative gradient.

Not sure what're you talking about. Anyway, the update in binary logistic mixer looks
like this:
Code:
  void Update( int y, int wr,int mw, int Limit, int p1,int p2 ) {
const int e = (((y<<MIX_SCALE_LOG)-pm)*wr)>>SCALElog;
w[0] += e*s[0];
w[1] += e*s[1];
}
and http://www.ctxmodel.net/files/mix_test/stems5.png (though there's some unnecessary stuff)

> because you interpret them as a probability, but maybe I have to start Mathematica again

11. Yeah wr is your update-rate, that possibility was all I wanted to point out.
There's only one question left, that's why you allow negative weights?
If i look at your derivation weights sum up to 1 but allowing negative weights (and using 2 for binary input) improves compression.

12. Well, the version I'm using (like in mix_test) looks like this:
Code:
struct iMixer {
word w;  // weights

void Init( int w0 ) {
w = w0 + hSCALE;
}

int rdiv( int x, int a, int d ) {
//    return x>=0 ? (x+a)>>d : -((-x+a)>>d);
return (x+a)>>d;
}

static int st( int p ) { return U.Stretch(p); }
static int sq( int p ) { return W.Squash(p);  }

int Mixup( int s0, int s1, int wr ) {
int x = s1 + rdiv((w-hSCALE)*(s0-s1),1<<(SCALElog-1),SCALElog);
x = rdiv( (x*wr), 1<<(SCALElog-1), SCALElog );
return x;
}

void Update( int y, int p0,int p1, int wq, int pm ) {
int py = SCALE - (y<<SCALElog);
int e = (py-pm);
int d = rdiv( e*(p0-p1), 1<<(SCALElog-1), SCALElog );
d = rdiv( d*wq, 1<<(SCALElog-1), SCALElog );
w += d;
}
};
So I don't use negative weights (even if it helps in some cases, it would be overtuning imho),
and instead of n-ary mixers I have trees of binary mixers.

13. if I translate correctly

mix:
x=s1+(s0-s1)(w-0.5)=(1-w+0.5)*s1+(w-0.5)*s0 (why the 0.5?)
x=x*wr (I tested that and it gained some bytes, but this seems rather shamanic)

update:
d=(1-bit-pm)*(p0-p1)*wq (is this derived from the "likelihood-mixer"?)
w+=d

14. 0.5 is for rounding. you don't need it with float-point arithmetics.
wr has an explanation - you can get it with p=p*(1-wr)+y*wr probability update and likelihood extrapolation of posterior probability.
for wq i don't know much, but it seems reasonable to control update rate anyway.
And actually I've got further improvements from tuning of stretch/squash mappings, which is even harder to explain.

15. Originally Posted by Sebastian
Ok, I understand linear mixing as p=p1*w1+...+pn*wn avoiding the squash and stretch.

But if your update formula for linear evidence mixing is wi := max[0, wi + (y - pi)(S n1i - S1 ni) / S0 S1],
(y - pi)(S n1i - S1 ni) / S0 S1 should be the negative gradient. Thus you're able to add a step-size on weight update.
It doesn't matter if you retrieve the formula from bit-counts, because you interpret them as a probability, but maybe I have to start Mathematica again
The trick here is that you can interpret bit counts as probability p = n1/n, n = n0+n1, but this implies giving greater weights to high counts (in addition to weighting by the mixer). For example, should n0=100, n1=100 (p = 1/2) have 100 times more weight than n0=1, n1=1?

To avoid this problem, n0*n1 is bounded so that both counts cannot get large. This has the effect that probabilities near 0 or 1 have large n and so get more weight. Logistic mixing also has this property because of the squash and stretch functions. The reason I think logistic mixing does better is because you can separate the modeling of bit counts from the mixer, which you can't do with linear evidence mixing. If you have a bit history like 0000001, then what is the probability of the next bit? Linear mixing uses a fixed model, like n0=4, n1=1 because of bit discounting rules that favor new data over old. This might not be the best rule because the data could be stationary or not. Logistic mixing lets the model decide by looking at the outcomes of the same sequence in other contexts to estimate the probability, then just mixes the probability.

16. Originally Posted by Shelwien
0.5 is for rounding
I don't get why this should help. You're shifting w to [0.5,1.5] and I tried that, but it did not change the output by one bit.

Also I don't understand where the term (p0-p1) in your update formula comes from.
Your mathematica derivation is clear to me.

17. 1. Consider the integer divisions

int(2/2) = int(1.0) = 1
int(3/2) = int(1.5) = 1
int(4/2) = int(2.0) = 2

and now with proper "integer"-rounding

int((2+1)/2)) = int (1.5) = 1
int((3+1)/2)) = int (2.0) = 2
int((4+1)/2)) = int (2.5) = 2

This simply improves accuracy and is an implementation issue. Don't mix it up with the derivation of formulas, etc.

2. p' = w p0 + (1-w) p1 + w p1 = p0 + (p1-p0) w

@Matt: I never understood why you call mixing "evidences" linear. It's neither linear in weights nor linear in counts or probability.

18. Originally Posted by toffer
int(2/2) = int(1.0) = 1
int(3/2) = int(1.5) = 1
int(4/2) = int(2.0) = 2

and now with proper "integer"-rounding

int((2+1)/2)) = int (1.5) = 1
int((3+1)/2)) = int (2.0) = 2
int((4+1)/2)) = int (2.5) = 2
this is done in the function rdiv

19. Originally Posted by toffer
@Matt: I never understood why you call mixing "evidences" linear. It's neither linear in weights nor linear in counts or probability.
So 2 models A and B have predictions Pa = Na1/Na, Pb = Nb1/Nb, where Na, Nb are total counts by models A and B, and Na1, Nb1 are counts of 1 bits. Then a linear mixer with weights Wa, Wb = 1-Wa has prediction p = (WaNaPa + WbNbPb)/(WaNa+WbNb). It's linear in weight*count, maybe not in weights or counts by themselves.

20. It's clearly linear if you abstract modeling from mixing, although I prefer the word linear regression.
The models probabilities are formed by occurence counters, but there could be many other solutions to this problem, like linear counters or what else.
I know the following is clear for everyone.

Let p=(1-w)*p1+w*p2 be the convex linear combination of two models.
Cost when retrieving a one-bit is -log(p).
A simple solution to this minimization-problem is gradient descent.
So taking partial derivative with respect to w gives p'=-(p2-p1)/p.
And a step for the parameter vector would be w:=w-L*p when retrieving a one-bit, where L is the step-size on the gradient line.

But for numerous reasons linear regression is bad for this problem, mainly because our dependent variables p1 and p2 are binary.
So logistic regression is a better answer.
Where p=1/(1+exp(-(s1*w1+s2*w2))). Taking the partial derivatives again, gives
w1:=w1+L*s1*(1-p) and w2:=w2+L*s2*(1-p) when retrieving a one bit, or simply w:=w-L*(s2-s1)*(1-p) if we use only one weight as in linear regression.
The only problem here is the interpretation of the weights.

But you can come to this solution from different initial situations.

So my mixer looks like this
Code:
#define MIXW2 // use two weights?

class Mixer2f {
const int wbits;
const int wscale;
int w1,w2,s1,s2,pm;
public:
Mixer2f():wbits(18),wscale(1<<wbits),w1(wscale>>1),w2(wscale>>1) { };
int idiv(const int v,const int s)
{
return (v+(1<<(s-1)))>>s;
}
int mix(const int _p1,const int _p2)
{
s1=domain.Stretch(_p1);
s2=domain.Stretch(_p2);
#ifdef MIXW2
pm=domain.Squash(idiv(s1*w1+s2*w2,wbits));
#else
pm=domain.Squash(s1+idiv(s1+(s2-s1)*w1,wbits));
#endif
pm=cGlobal::max(1,cGlobal::min(PSCALE-1,pm));
return pm;
}
void update(const int bit,float rate)
{
int e=(bit<<PBITS)-pm;
int r=wscale*rate;
#ifdef MIXW2
int d1=idiv(e*s1,PBITS);
int d2=idiv(e*s2,PBITS);
w1+=idiv(r*d1,PBITS);
w2+=idiv(r*d2,PBITS);
#else
int d1=idiv(e*(s1-s2),PBITS);
w1-=idiv(r*d1,PBITS);
#endif

w1=cGlobal::max(-wscale,cGlobal::min(wscale,w1));
w2=cGlobal::max(-wscale,cGlobal::min(wscale,w2));
}
};
results for the above mixer on calgary corpus with my previous mentioned LZ-codec (18-bit weights).
Code:
two weights unrestricted: 843.302
two weights from [-wscale,+wscale]): 843.187
two weights from [0,+wscale]): 843.230

one weight from [-wscale,+wscale]): 844.308 (mathematical incorrect)
one weight from [0,+wscale]: 844.254

21. @Matt
This function

$f(x)&space;=&space;m&space;x&space;+&space;n$

is linear in x. A mixing function f outputs a probability p. Your expression is

$f(w,&space;n_a,N_a,&space;n_b,N_b)&space;=&space;\frac{w&space;n_a&space;+&space;(1-w)&space;n_b}{w&space;N_a&space;+&space;(1-w)&space;N_b}$.

So, clearly, this function f is not linear in w, n_*, N_*, w or any product of them. The nominator and denominator are linear in these variables, i guess this is what you mean. However the mixing function is not.

But for numerous reasons linear regression is bad for this problem, mainly because our dependent variables p1 and p2 are binary.
Logistic regression typically is used when one wants to map "explainatory" variables - not in the range [0, 1] onto probabilities, since the sigmoid function ("squash") is asymptotically bounded by 0 and 1. Common examples are predicting the probability of a stroke as a function of age, weight, etc.

The only problem here is the interpretation of the weights.
Somehow i feel that people tend to ignore or don't understand this.

22. @Sebastian:
here's a simple coder with static linear mixing of order-0 and order-1 statistics.
http://nishi.dreamhosters.com/u/rc_v3.rar
Can you post your mixer results using it and the included sample file (book1bwt)?

23. @toffer, I see what you mean. It is not linear when you include the normalization in the denominator.

24. Originally Posted by Shelwien
Can you post your mixer results using it and the included sample file (book1bwt)?
I really don't get you here. My mixer with static mixing on book1bwt? If I use your mixer, results won't be too different from yours

What I did now was disabling matching in my LZ-codec and using the O2-O1-O0 Literal-Model (2 mixers) straight, without changing constants.
This gets 215.589 bytes with logistic mixing and two weights, and 223.168 using my linear mixer with one weight, though using the same update-rate (0.25) for both is a bad idea.

25. > I really don't get you here. My mixer with static mixing on book1bwt? If I use your mixer, results won't be too different from yours

I meant that you can plug in your adaptive mixer into it*, and see if it would improve the results over static mixing.
Also I have a lot of other results for that file with mix_test and o01 coders, so it would be easy to see how good is your mixer.
I don't think there's a better method of mixer quality estimation - when you compare completely different coders, results
can be affected with any other implementation details, not only mixers.
For example, in mix_test I had ~214xxx for book1 with logistic mix2(o0,o1), does that mean that your mixer is bad?

* using the same probability inputs: they're 15-bit probabilities of zero.

26. Originally Posted by Shelwien
For example, in mix_test I had ~214xxx for book1 with logistic mix2(o0,o1), does that mean that your mixer is bad?
If I add two more mixers for orders 0 and 1 I get about 213.xxx.

I have not played with BWT-data, and you're using a hell of constants. I have about 4.
I'll have to rewrite my squash/stretch functions to be compatible with different probability accuracies, then i'll post results. And an optimizer-backend wouldn't be too bad either.

Another thing which bothers me is the whole concept of linear mixing with 2 inputs p=(1-w)*p1+w*p2.
As mentioned above I suggested something as linesearch to find the optimal update-rate.
But that would be stupid in this case, because the optimal update rate would be either 0 (p1>p2) or 1 (p2>p1) if we retrieve a '1'-bit.
So gradient descent is just a compromise in this case to "average" the time-dependent functions p1 and p2.
There must be a better solution to this. You mentioned something as updating p like a counter and backpropagating the update to w. How does this works?

27. > > For example, in mix_test I had ~214xxx for book1 with logistic

> Dunno, your O2-mix (with linear mixer) does about 218.xxx, so I think it's not too bad.

http://nishi.dreamhosters.com/u/rc_v3.rar
Implements static linear mixing of order0 and order1 contexts, and its result is 216,774.
Which o2 are you talking about?

> If I add two more mixers for orders 0 and 1 I get about 213.xxx.

That's another matter. We have specific probability inputs here, defined by rc_v3 above.
And I'm suggesting to compare mixers by attaching them to that coder and looking at the results.
For now we have the default static mix and your mixer to compare, I'd post the results
with my logistic mixer later if necessary.

> I have not played with BWT-data, and you're using a hell of constants. I have about 4.

You're not supposed to tweak these anyway (except for M_mW0 = static mixer weight).
Also its best to not have any constants, but in reality all coders have them, just
in the implicit form. For example, p=squash(s) implies p=squash(1*s) if there's no
such multiplier already in the formula for s.
Thus, coders with more tuned parameters win
(eg. bsc uses the same linear counters with 4 update parameters).

> I'll have to rewrite my squash/stretch functions to be compatible
> with different probability accuracies, then i'll post results.

You don't really have to. Just shift the inputs to your desired precision,
we're not looking at the speed here anyway, and rounding likely won't matter either -
in fact it can improve compression there.

> And an optimizer-backend wouldn't be too bad either.

A demo of my optimizer is available on ctxmodel, but its perl-based, so you probably won't use it.
Then there's also m1, but its optimizer would be hard to attach afaik.
And then I remember osman talking about using matlab for optimization (I suppose you can
define an external function there, which would estimate your entropy for given arguments).

> Another thing which bothers me is the whole concept of linear mixing with 2 inputs p=(1-w)*p1+w*p2.

I think its fairly obvious.
You have your current probability-of-1 estimation p and new estimation y (=bit value).
Again, its a switching model - you can either use p to predict the next bit z (after y), or y.
Let's suppose that the probability of y producing a shorter code for z is w.
So, with probability (1-w) p might be better, and we have 4 cases:
z=0 using p, probability is (1-w)*(1-p)
z=1 using p, probability is (1-w)*p
z=0 using y, probability is w*(1-y)
z=1 using y, probability is w*y
But we only care about z value, so which prediction is used doesn't matter,
so we can merge the intervals.
z=0 -> (1-w)*(1-p)+w*(1-y)
z=1 -> (1-w)*p+w*y
And here we have linear mixing, where w is not just some parameter, but a probability
of y being better for coding.

> As mentioned above I suggested something as linesearch to find the optimal update-rate.

I'm not aware of any usable CM components with adaptive update parameters.
Each update affects all the subsequent estimations, so you can't optimize
the update parameters based only on the single next bit.
And its hard to optimize for bit strings here, because the function
becomes very complex.
Well, there's that approximated polynomial approach which we discuss sometimes
with toffer, but afaik its still theory.

> There must be a better solution to this. You mentioned something as
> updating p like a counter and backpropagating the update to w. How
> does this works?

Yes, my linear mixer and SSE components are based on that.
Suppose you have an update formula for a counter: p'=f(p,y).
And then you have a linear mix: p=(1-w)*p1+w*p2
The idea is that if p'=f(p,y) is a good update function, then
why not use it here - p1 and p2 may be constants for all we know.
So we need such a w' that the next mix would produce p':
(1-w')*p1+w*p2 = p'
w' = (p'-p1)/(p2-p1)
To that, I usually also add a |p2-p1|>threshold check, to avoid
errors due to limited integer precision (also its common sense
that with near equal p1 and p2 there's no reason to change the weight).

This is only applicable to linear components though, and the current
update formula in logistic mixer is not bad anyway - bruteforce update
doesn't gain much there (=collecting codelengths for all weight values
and choosing the next weight by best codelength).

28. Thanks for your suggestions, I'll post my results in the next days.

Originally Posted by Shelwien
> Another thing which bothers me is the whole concept of linear mixing with 2 inputs p=(1-w)*p1+w*p2.

I think its fairly obvious.
You have your current probability-of-1 estimation p and new estimation y (=bit value).
Again, its a switching model - you can either use p to predict the next bit z (after y), or y.
Let's suppose that the probability of y producing a shorter code for z is w.
So, with probability (1-w) p might be better, and we have 4 cases:
z=0 using p, probability is (1-w)*(1-p)
z=1 using p, probability is (1-w)*p
z=0 using y, probability is w*(1-y)
z=1 using y, probability is w*y
But we only care about z value, so which prediction is used doesn't matter,
so we can merge the intervals.
z=0 -> (1-w)*(1-p)+w*(1-y)
z=1 -> (1-w)*p+w*y
And here we have linear mixing, where w is not just some parameter, but a probability
of y being better for coding.
I think you misunderstood me here. With "whole concept" I don't mean the linear combination of two predictions. But maybe this is only a language problem by myself.

I don't understand what the switching between y and p has to do with the switching between p1 and p2?
I think about something like delaying the mixer update in some form, but maybe this is the same as your suggested "counter-update".

Ok simple and clear for a dumb ass like me
p=(1-w)*p1+w*p2 describes a switching source.
If you view w as a convex weight or a probability doesn't really matters to me.
The problem arises if we adjust w when we encounter the bit y. We do it in a way that the model that predicted y better, is favored in the next step. Simple as that.

But we discard some information in the update step where the models p1,p2 and the weight w get updated, because p1 and p2 get both adjusted in the direction of y too.

Example:
p1=0.6 p2=0.2 w=0.5
y=1: p1 was better, make w smaller, say w=0.3, update p1,p2 say p1=0.8, p2=0.9 (p2 is a fast counter)
y=1: p2 was better, make w larger, w=0.5

but it would have been better if we adjusted w beforehand in the direction of p2

see what I mean? Using the new p1 and p2 directly makes things really worse so I have to take a deeper look.

29. > I don't understand what the switching between y and p has to do with
> the switching between p1 and p2?

Its the same thing. We have two predictions and a probability of choosing one.
Btw, my counters mix 3 predictions in fact - p,y and a static one.

> I think about something like delaying the mixer update in some form,
> but maybe this is the same as your suggested "counter-update".

That method is only applicable to linear components.
It would be impractical for a logistic mixer, as it would involve exp/log.
And anyway, precision-wise, there's nothing better than bruteforce -
see mixer in http://nishi.dreamhosters.com/u/wav_00.rar

> If you view w as a convex weight or a probability doesn't really matters to me.

And I think that knowing what it is really helps.
Otherwise the idea of n-ary logistic mixer redundancy due to Sum w_i != 1
would never appear.
Also knowing what kind of a probability it is allows to implement some
alternate solutions - see "likelihood" update in http://www.ctxmodel.net/files/mix_test/mix_test_v4.rar
its based directly on estimation of the probability of one of inputs
producing shorter codes, which introduced a weird concept of different
contexts for mixing and mixer update, unthinkable otherwise.

> The problem arises if we adjust w when we encounter the bit y. We do
> it in a way that the model that predicted y better, is favored in
> the next step. Simple as that.

Sure, but details matter. They only don't in asymptotic results for stationary data.
While in reality choosing a wrong update rate for a good adaptive mixer can easily lead
to worse results than with static mixing.

> But we discard some information in the update step where the models
> p1,p2 and the weight w get updated, because p1 and p2 get both
> adjusted in the direction of y too.

Note that there're contexts usually, in which case the updated weight likely
won't be applied to previously updated inputs.
Although you're right, and doing a few iterations of mixer updates sometimes
helps too. But overall its much slower and not very stable, so isn't used
in actual coders.

30. If anyone is interested i can rip off the optimizer and make a simple example. It is a fairly general-purpose genetic optimizer, in fact i developed and tested it with the minimization of common testing functions, like Rosenbrock or Rastrigin.

Page 1 of 5 123 ... Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•