If we start withOk, but remember to mention that you work with P(y=0). pm might be misleading since it is no linear probability, but a logistic estimation.
pm = w1*Log[p1/(1-p1)+w2*Log[p2/(1-p2)]
(pm,p1,p2 are probabilities of 0, y=bit, result is code length (negative))
it can be transformed intoI think conversion terms the reverse transform from logistic space (your pm) to linear space, right? But you "converted" using exp(x) not Squash(x), having ln(1+exp(... instead of ln(exp(... makes the transformations you did impossible. That's not the same and the result doesn't have the meaning of a linear probability. The p_i in your equations are probabilities, but the whole expression is in logistic space, thus conversion should be done with Squash.
y=0: w1*Log[p1] +w2*Log[p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]
y=1: w1*Log[1-p1]+w2*Log[1-p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]
And then, if we convert that to probabilities:
So this is what the actual mixing looks like.
...If it was a probability the rest would be correct, i guess.
The "clipping" simply increases the dynamic range of Squash. Note that this can be useful, since the derivative (slope) near the asymptot at +-1 is rather small. In NN terms that is often done, too. The gradient descent gets slowed down by the low slope. So my suggestion would be, instead of optimizing the mappings themself (their stems) or the learning rate (as we discussed) one could simply optimize the range of Squash. In the implementation i initially gave you it was from -8...8. Thus you could save stems for squash from -10...10 and optimize the range -r ... r manually. Afterwards rescale the function to map to +-1 at the edges -r, r.
And meanwhile, the current result with separately optimized mapping
for pm=Squash(x) (to avoid clipping or something) is 213766 %).
So it was possible to improve it almost by 1k just by tweaking the mixer
(comparing to http://encode.ru/forum/showpost.php?p=7777&postcount=20