Page 1 of 2 12 LastLast
Results 1 to 30 of 45

Thread: Logistic mixing & modeling

  1. #1
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Logistic mixing & modeling

    After thinking a bit about PAQ's mixer - which uses gradient descent to compute weights iteratively - i thought about applying that to bit models. Instead of having a counter with a state p (probability) in linear space one could have a logistic counter s=St(p) and update its state in logistic space. When using logistic mixing one doesn't need to convert the input probabilities p into logistic space - thus no stretch function needs to be implemented. Alltogether a logistic counter with gadient based update could look like:

    h = -ln(a+b*p)
    p = Sq(s)
    dh/ds = -(y-p)
    (derivation and terms like in this thread http://encode.ru/forum/showthread.ph...ative#post7816)

    s' = s + L*(y-p) = s + L*(y-Sq(s))
    Sq(x) - Squash 1/(1+e^-x)

    That update can be derived like i did in the Mixing strategies thread - and actually equals a fixed input ("one") with a "weight" s. Hence it is just a weigthed bias neuron. BTW that update appears in bias neuron updates using logistic mixing - since it's the same. Nothing new actually.

    Now i noticed that we could actually "reinvent" the wheele: When using logistic mixing along with such a model - why not take the mixing functions influence into account, adjusting the bit models proportional to their contribution to the mixed output? Well integrating that into logistic mixing gives us a combined purely logistic mixing and modeling approach:

    (1) each model i has a state s_i in logistic space

    (2) the mixer assigns weights w_i to the models s_i

    (3) a mixed prediction p in linear space is

    p = Sq( sum w_i s_i )

    (4) the coding error e is

    e = y-p

    (4) weights are updated

    w_i' = w_i + L*e*s_i

    (5) predictions are updated

    s_i' = s_i + M*e*w_i

    0< L, M < 1 are learning rates. Note that -e*w_i is simply the partial derivative of h (coding cost) wrt to an input s_i. Alltogether this can be interpreted as a neural network: the input layer is made of bias neurons, with weights selected by a context. The output layer is a single neuron with linear activation and squash as the output function. The hidden layer just forwards the inputs (linear neurons). It learns using error backpropagation which minimizes coding cost rather than RMSE.

    I have no idea how well this will preform in practice - but i hope that it might work as well as logistic mixing. And most of this is already covered by logistic mixing. This is just an extension of modeling integration.

    I'd implement this when i've got some spare time.

    It might even be possible to get rather good results with *static* weights (no weight update) since the logistic model updates take the actual weight values into account.

    Comments?
    Last edited by toffer; 9th June 2009 at 13:12.

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    It would be great if you'll release a fpaq0-like compressor as a demo. Just a simple model with no templates and all that crazy constants as some authors like to embed into their programs to make the source code completely unreadable and a bit encrypted, so nobody can understand what's going on...

  3. #3
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You were faster than me I was thinking about it to remove completely stretch/squash functions for decreasing cache misses (I think, the tables are big enough to pollute the cache). I thought to implement a new counter model based on PAQ's mixer weight update/probability mix functions. But, got stuck in how to remove "even" final squash function. How can we integrate final logistic probability in RC? I mean, multiplaction free RC?
    BIT Archiver homepage: www.osmanturan.com

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Osman
    Actually the lookup tables are rather cache friendly since probabilities are clustered near 0/1 in linear space - just remember the model output histograms. I made some tests there and there is no difference in speed.

    @Encode
    Well if you want to have different parameters for different models there is no other flexible way in passimg them to the models.

    I'd implement that if i've got spare time. Don't forget that i'm rather busy with my final thesis. Actually implementing is pretty straight forward, since if you know how to update a PAQ mixer:

    e = y-p
    w_i' = w_i + L*e*s_i

    You know how to update models, too

    s_i' = s_i + M*e*w_i

  5. #5
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Toffer:
    Yes. I remember. I meant a bit different thing: they are big enough (2x8 KiB in total) to place in cache (especially in AMD architectures). There are the other data to which should be cached (i.e. SSE entries). So, by removing stretch/squash from whole implementation should leave more space for better caching. Also, I'm sure multiplaction free RC would be super fast.

    Edit: BTW, IIRC I intentionally removed stretch/squash table to see they were bottleneck in prior to BIT 0.5 (maybe more older). And I observed that it's faster. Yes I know the computation becomes more and more weird (the output is completely garbage). But, I proved that stretch/squash functions are slow part of PAQ mixer.
    Last edited by osmanturan; 9th June 2009 at 14:06.
    BIT Archiver homepage: www.osmanturan.com

  6. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't think that a multiplication free RC would be possible (with enough precision). But i never made any investigations there.

    The cache is adaptive - and i doubt that there'll be more than a few (<10) 64 byte cache lines used actually. Since - as i said - the functions are evaluated near 0/1 in almost all cases. These locations are frequently used and will be cached. Others won't. Hence the other stuff is pushed out. As i said i tested this effect. There is no speed difference. On the first look i was pretty surprised. But looking at a model output histogram reveals why this hapens.

  7. #7
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    BTW, correct me if I'm wrong:
    At initially, we have a model which hold w and s. And they are all zero at the beginning. Because, we need assign p=0.5 and if we derive s from p, s=Sq(p=0.5), so it's s=0. So, here is the question: How can we update the function if coefficient are zero!? In PAQ mixer, s_i part in linear space, so there were no problem (p=0.5 at the beginning at there). But, in here it's logistic and it's zero at the beginning.

    e = y-p
    w_i' = w_i + L*e*s_i
    s_i' = s_i + M*e*w_i

    Edit: I miswrote. I meant s_i part is not in linear space in PAQ mixer but it comes from a linear space source and it can be a non-zero value.
    Last edited by osmanturan; 9th June 2009 at 14:59.
    BIT Archiver homepage: www.osmanturan.com

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That's implementation detail

    Only the model holds s, since it replaces counters.

    Just set w_i(0) = w0/N, where N is the number of models, w0 can be optimized or set to 1.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @encode:
    > It would be great if you'll release a fpaq0-like
    > compressor as a demo.

    http://ctxmodel.net/files/mix_test_v3.rar
    Contains a simple (simpler than fpaq0 imho)
    bitwise coder implementations, both one based
    on my linear mixer, and one based on toffer's logistic mixer.

    > as some authors like to embed into their
    > programs to make the source code completely unreadable

    Now, look who's saying that

    -----------------

    > I was thinking about it to remove completely
    > stretch/squash functions for decreasing cache misses (I
    > think, the tables are big enough to pollute the cache).

    As to logistic mixer, then Squash() for it can be
    calculated by replacing the full table (SCALE) with a smaller
    table (1<<N) and additional N multiplications (N=5 atm).
    This can be also applied to Stretch(), but somehow I think
    that it'd lose a lot of precision after that, as atm
    the stretch table is initialized so that Squash(Stretch(i))=i.

    Also, both squash and stretch tables can be cut in half
    as they're symmetric.

    > But, got stuck in how to remove "even" final squash
    > function. How can we integrate final logistic probability
    > in RC? I mean, multiplaction free RC?

    Well, either with multiplications, or with lookup tables -
    there's no other way (obviously).
    In fact, there're many multiplication-free ari coders
    which all use logistic counters (state-based) and maintain
    log(range) instead of range. Afaik they all also have
    lookup tables.

    > Instead of having a counter with a state p (probability)
    > in linear space one could have a logistic counter s=St(p)
    > and update its state in logistic space.

    Now, why? Its a shamanic approach in full bloom
    In fact, I'm still waiting for an explanation even considering
    why the logistic mixer works - aside from its fixed nonlinear
    mapping, hidden in the form of "update rate".
    I mean, we have an explanation for the linear mixer - the weight
    there is a probability of certain submodel being optimal.
    But now, what's the interpretation of logistic weight?
    And logistic mixing itself?
    So, logistic mixing does work, that's nice and all.
    But actually we need to understand why it works, based on
    which side effects - that would certainly allow to acquire
    even better results.
    In fact, yesterday I finally tried an obvious thing -
    replacing w0 and w1 in the binary logistic mixer with
    w and 1-w and rederiving the update formula.
    And what do you think, compression got suddendly improved
    from 214395 to 214027. That's beside the fact, that
    all 5 probability mappings involved in the binary mixing
    and its update, can be separately tuned, providing some
    additional improvement (quoted results only use single
    tuned mapping - original result was 214695 btw).

    Well, anyway, I already tried replacing the linear counters
    with (0,1) mixers (improved ones), and optimizing the coder -
    current result is 218244 and it doesn't seem to go anywhere further.

    Also, there's some common sense behind the linear counters in fact.
    If we have an order0 (memoryless) static generator, like
    putc( '0'+(rand()>=P) )
    and it outputs N0 zeroes and N1 ones, then there're (N0+N1)!/N0!/N1!
    possible permutations of that (number of current permutation has
    to be encoded), and the optimal coding probability is N0/(N0+N1),
    with N0~=N*P/SCALE (product of such intervals equals to the
    number of permutations so its perfect).

    Then, probability-based linear counters are imprecise approximations
    of that, also taking into account data nonstationarity.
    But actually they work completely the same as combinatoric counters
    after a certain initial period (see http://ctxmodel.net/rem.pl?-2).

    Well, I can't deny the possibility that for some types of data
    the logistic counters would be better (due to faster adaptation or something).
    But BWT output is certainly not that - its as close to memoryless order0
    source as natural data can be. And the same applies to context histories
    as they're fairly similar to BWT output.
    Last edited by Shelwien; 9th June 2009 at 15:30.

  10. #10
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well this is not shamanic at all...

    Reasons for the logistic mixer to work well:

    * Stretch has high slopes near 0/1, hence the resolution increases there (quantisation function). Luckily this coincides with the model output histogram of a binary model - most predictions are made near 0/1:
    http://encode.ru/forum/attachment.ph...2&d=1220619017

    * The weight update formulation is a simplified iterative numeric minimization technique - gradient descent with a fixed step length. Thus weights are adjusted to minimize the actual coding cost.
    http://en.wikipedia.org/wiki/Gradient_descent

    The second thing - including a logistic estimation - just replaces St(p_i). The updates rule i posted is again an interative minimization of coding cost *taking into account* the actual weight assigned to a submodel thus using more information than a normal counter.

    As to a linear counter... Indeed an optimal (wrt. coding cost) estimation for P(y=1) = N1/N assuming a memoryless stationary source having seen N1 bits out of N beeing 1.

    But who tells you that the nonstationary update N1'=c*N1+y, N'=c*N+1, or p' = (1-c)*p + c*y (this is the first model in case of N->infinity, stationary state) does some kind of optimization regarding the coding cost? Isn't it "shamanic" to fix it for a nonstationary source by inserting an aging factor c?

    Well in fact this model minimizes a nonstationary RMSE 1/N sum k=0..N-1 c^N-k (y(k)-p)^2 and is not realted to entropy.
    Last edited by toffer; 9th June 2009 at 16:03.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Well this is not shamanic at all...

    Magic with scientific elements is still magic

    To be specific:
    1. Log[p/(1-p)] logistic mapping comes off the ceiling.
    I doubt you can say anything about its choice except
    that it fits your image - which is obviously not enough.
    And with similar but different mapping being better for
    specific application than Log[p/(1-p)] it becomes even
    harder to explain.
    2. You didn't have any suggestions as to avoiding weight overflows,
    or changing the mixing formula to improve the results.
    But using a single weight there improves both compression
    and memory usage (a single 16-bit weight is currently used),
    and it doesn't seem final either.
    3. Using numeric optimization techniques doesn't guarantee
    that the optimization target and iteration functions are
    chosen consciously. For example, having a rate coefficient
    in the logistic mixer update is a really bad idea - that is,
    same effect can be acquired in a better way.

    My point is that well-understood techniques (non-shamanic that is)
    provide much better flexibility. Combinatoric/linear counters are
    a good example of that. They can be modified for any specific
    application, while about using paq mixer with alphabet distributions
    I hear just "it won't work". Why, it would work - when we'll properly
    understand how it works.

    > Reasons for the logistic mixer to work well:

    I talked about the interpretation, not reasons.
    Like "probability p is the size of interval in code space,
    assigned to data strings starting with current symbol" or
    "weight w in linear mixing is a probability of submodel M1
    being optimal for coding".
    We don't know anything similar about logistic mixer for now.

    > * Stretch has high slopes near 0/1, hence the resolution
    > increases there (quantisation function). Luckily this
    > coincides with the model output histogram of a binary
    > model - most predictions are made near 0/1:

    I can just add the p'=Squash(wr*Stretch(p)) mappings (wr=constant)
    into my linear coder and it would help too. So what?
    That's not all what a paq mixer does.

    > * The weight update formulation is a simplified iterative
    > numeric minimization technique - gradient descent with a
    > fixed step length. Thus weights are adjusted to minimize
    > the actual coding cost.

    And then we have a choice of multiple possible iteration methods,
    and which one is better can be only determined by manual testing,
    and which is the best is unknown.

    > The second thing - including a logistic estimation - just
    > replaces St(p_i). The updates rule i posted is again an
    > interative minimization of coding cost *taking into
    > account* the actual weight assigned to a submodel thus
    > using more information than a normal counter.

    Its unknown whether its good to use more information there.
    Anyway, for now logistic counters have worse performance.

    > But who tells you that the nonstationary update
    > N1'=c*N1+y, N'=c*N+1, or p' = (1-c)*p + c*y (this is the
    > first model in case of N->infinity, stationary state) does
    > some kind of optimization regarding the coding cost? Isn't
    > it "shamanic" to fix it for a nonstationary source by
    > inserting an aging factor c?

    1. Afair its optimal for weighted code length
    Sum -c^(N-i)*Log[ p_i ]

    2. Its true that this exponential weighting is arbitrary.
    Even more can be said - different weighting functions
    were tested and there were better ones.
    But here its a matter of computation cost, these better
    functions are much more expensive, but not much better
    compression-wise, so this one is used.
    So this is not exactly shamanic, as it was consciously
    chosen after some evaluation.

    ****************

    Btw, I actually have a draft of logistic mixing interpretation.

    1. The idea is that coding cost is a linear combination
    of Log[p_i], so linear mixing of submodel inputs in
    log space might be more fitting... and adaptive.

    2. The bit coding cost after such mixing would look
    as follows:
    ((1-y)*Log[p1]+y*Log[1-p1])*(1-w) + ((1-y)*Log[p2]+y*Log[1-p2])*(1-w)
    and
    (1-y)*Log[p]+y*Log[1-p] = Log[p]-y*(Log[p]-Log[1-p]) = Log[p] - y*Log[p/(1-p)]
    so mixing function becomes
    Log[p1]*(1-w)+Log[p2]*w - y*(Log[p1/(1-p1)]*(1-w)+Log[p2/(1-p2)*w)

    And that right part is what my logistic mixer currently deals with,
    not the full formula.

    Incidentally, this might also solve the paq mixer incompatibility
    with non-binary alphabets.
    Last edited by Shelwien; 9th June 2009 at 21:26.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I can just add the p'=Squash(wr*Stretch(p)) mappings (wr=constant)
    > into my linear coder and it would help too. So what?

    Btw, my iterpretation to effect of such static mappings is
    "under-update". That is, most counters start with 0.5 and
    then shift closer to 0 or 1 after several updates.
    And as it takes quite a few updates to arrive to a stable
    probability, its kinda resonable that some compensation
    mapping might help.

    But this is kinda far from what one can see while looking
    at w_i += wr*e*s_i logistic weight update.
    Last edited by Shelwien; 9th June 2009 at 19:51.

  13. #13
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    > 1. Log[p/(1-p)] logistic mapping comes off the ceiling.
    > I doubt you can say anything about its choice except
    > that it fits your image - which is obviously not enough.
    > And with similar but different mapping being better for
    > specific application than Log[p/(1-p)] it becomes even
    > harder to explain.

    My iterpretation for Matts design here is:

    Matts idea was to use a NN for mixing - a design constrain. A sigmoidal function (differentiable threshold element) as an output neuron provides weight adaption via minimizing coding cost without a *divison*. Thus it is just a modification of his previous (< PAQ6) mixers which minimized coding cost using gradient descent, too. Now if the output p=Sq(s) of that *nonlinear* output neuron has the meaning of a "probability", s certainly is something different. Thus one can assume Sq(s) to be the inverse of some function - St(p) in particular - which transformed probabilities into a different domain (#).

    I know that this lacks a mathematical interpretation but to my understanding of that intuition that makes sense. On the other hand i think that the actual design process is reversed. Usually it'd be better to analyse the inputs and build the model based on that. Not the reverse way.

    And i still don't know what'd be wrong in knowing the reasons for it to work well - even if these reasons luckily coincided with a more or less random design process.

    > 2. You didn't have any suggestions as to avoiding weight overflows,
    > or changing the mixing formula to improve the results.
    > But using a single weight there improves both compression
    > and memory usage (a single 16-bit weight is currently used),
    > and it doesn't seem final either.

    That was already done in ZPAQ. We already talked about this - i didn't reject any of these ideas. And why shall i tell you how to implement it if you can do it on your own. I'm sure both of us know how this can be implemented.

    > 3. Using numeric optimization techniques doesn't guarantee
    > that the optimization target and iteration functions are
    > chosen consciously.

    Coding cost (entropy) should be a good target function. And i never said that gradient descent is a perfect minimization technique - i just sayed that its application in here *makes sense* since it is a technique to minimize a certain cost function. In addition here it's easy to implement and seems to be efficient.

    > For example, having a rate coefficient
    > in the logistic mixer update is a really bad idea - that is,
    > same effect can be acquired in a better way.

    Well as you know it doesn't play a great role if you locate it in mixing or in updates mathematically. And i already explained why it appeared in update. The normal procedure in iterative numeric minimization of a function H(x) is to estimate a search direction d(k) and a scalar step length L(k) in each step k having an initial estimation x(0). So x(k+1) = x(k) + L(k)*d(k). Having obtained a direction d(k) the dim x dimensional optimization problem is reduced to a 1 dimensional min H( x(k) + L(k)*d(k) ) w.r.t. L(k) in each step k. Thus the step lengths L(k) generally differ in each step k. A simple solution is to assume L(k) = L for all k. That's where that L comes from.

    > while about using paq mixer with alphabet distributions
    > I hear just "it won't work".

    I mentioned the reason several times: http://encode.ru/forum/attachment.ph...2&d=1220619017
    That's different for bytewise models as you know.

    > I can just add the p'=Squash(wr*Stretch(p)) mappings (wr=constant)
    > into my linear coder and it would help too. So what?
    > That's not all what a paq mixer does.

    I never said that and i don't know what you mean there actually.

    > And then we have a choice of multiple possible iteration methods,
    > and which one is better can be only determined by manual testing,
    > and which is the best is unknown.
    > Its unknown whether its good to use more information there.

    Mathematically using more information makes sense, since you differentiate the whole system dh/ds_i = dh/dp * dp/dx * dx/ds_i.
    h = -ln a+b*p
    p = S(sum s_i w_i)
    And not just bypass the effect of each individual component. That's the foundation of error backpropagation, too.

    And finally citing my first post: "...I have no idea how well this will preform in practice..."

    > 1. Afair its optimal for weighted code length
    > Sum -c^(N-i)*Log[ p_i ]

    Right - i didn't know. But in this case the minimizer of -sum c_k*ln(a_k+b_k*p) equals the minimizer of sum c_k (y_k-p)^2 which i found pretty interesting.

    > 2. Its true that this exponential weighting is arbitrary.
    > Even more can be said - different weighting functions
    > were tested and there were better ones.
    > But here its a matter of computation cost, these better
    > functions are much more expensive, but not much better
    > compression-wise, so this one is used.

    Well the same applies to simple gradient descent applied here as one possibility of numeric minization...

    ... and the Hessian is singular

    >
    > ****************
    >
    > Btw, I actually have a draft of logistic mixing interpretation.
    >
    > 1. The idea is that coding cost is a linear combination
    > of Log[p_i], so linear mixing of submodel inputs in
    > log space might be more fitting... and adaptive.
    >

    But where does Squash come from? If you answer that it's applied to get probs back from logistic space you'd confirm (#).
    Last edited by toffer; 9th June 2009 at 22:21.

  14. #14
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Shelwien View Post
    In fact, yesterday I finally tried an obvious thing -
    replacing w0 and w1 in the binary logistic mixer with
    w and 1-w and rederiving the update formula.
    And what do you think, compression got suddendly improved
    from 214395 to 214027.
    Looks very interesting. Could you please provide details about your changes.
    Enjoy coding, enjoy life!

  15. #15
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Get the inputs: s_i = St(p_i)

    Apply mixing: p = Sq(1-w)*s_0 + w*s_1) = Sq(s_0 + (s_1-s_0)*w)

    Update: w' = max(0, min(1, w + L*(y-p)*(s_1-s_0) ))

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    If we start with

    pm = w1*Log[p1/(1-p1)+w2*Log[p2/(1-p2)]
    Log[1/(1+Exp[-pm])*(1-2*y)+y]
    (pm,p1,p2 are probabilities of 0, y=bit, result is code length (negative))

    it can be transformed into

    y=0: w1*Log[p1] +w2*Log[p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]
    y=1: w1*Log[1-p1]+w2*Log[1-p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]

    And then, if we convert that to probabilities:

    y=0: p1^w1*p2^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
    y=1: (1-p1)^w1*(1-p2)^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
    ((1-y)*p1^w1*p2^w2+y*(1-p1)^w1*(1-p2)^w2)/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)

    So this is what the actual mixing looks like.
    Now, I can interpret this as taking a probability of a string of N0*w1 zeroes
    encoded with first model (using p1) and then N0*w2 zeroes encoded with second
    model, and inferring an "average" zero probability out of it.

    pm0^N0 = p1^(N0*w1)*p2^(N0*w2)

    Then we do the same with a string of ones, and normalize the estimation into
    the same distribution, and that's the result.

    pm1^N1 = (1-p1)^(N1*w1)*(1-p2)^(N1*w2)
    pm = pm0/(pm0+pm1)

    Unfortunately, this doesn't provide any hints for improvement right away,
    as d/dw would be the same as in "old" logistic mixer, and stretch tables
    would be still required for update, even though in mixing part they
    can be replaced with plain exp and log tables.

    But still, now I understand what happens much better, which would be helpful.
    Also, this representation provides an easy way of algorithm generalization
    for larger alphabets (which might be useful for speed optimizations).

    And then, this might open a way for more complex estimators based on
    string probabilities - for example, with exponential weighting for
    the string symbols.

    Or, we can try finding an optimal parsing of already processed data
    into symbols belonging to one of submodels, and thus improve the
    weight estimation precision.
    Last edited by Shelwien; 11th June 2009 at 00:18.

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Here are some results of constraining a 2 input mixer so that the weights add to 1. This adds 3 results to my previous zpaq results on book1bwt. For simple mixing strategies, the compression is slightly worse (mix2) than when the two weights can change independently. In the third case where the models are chained and mixed again, constraining the weights helps a bit. In each of the 3 new cases, I tuned the learning rate for max compression.

    Code:
    (results for zpaq c(this_file) x book1bwt)
    (note: parameters are tuned)
    comp 2 0 0 0 (hh hm ph pm)
                               (uncomment any of below to run)
    (235069 simple order 0)    (1 0 cm 9 3)
    (237916 simple order 1)    (2 0 const 0  1 cm 17 5)
    (220811 static mixer)      (3 0 cm 9 3   1 cm 17 5     2 avg 0 1 120)
    (219581 adaptive mixer)    (3 0 cm 9 3   1 cm 17 5     2 mix 0 0 2 14 0)
    (219858 adapt 1w mixer)    (3 0 cm 9 3   1 cm 17 5     2 mix2 0 0 1 140 0) (worse)
    
    (221070 indirect order 0)  (1 0 icm 4)
    (231883 indirect order 1)  (2 0 const 0  1 icm 12)
    (215055 indirect chained)  (2 0 icm 4    1 isse 12 0)
    (214936 ind static mixer)  (3 0 icm 4    1 icm 12      2 avg 0 1 160)
    (214514 ind adapt mixer)   (3 0 icm 4    1 icm 12      2 mix 0 0 2 14 0)
    (214809 ind adapt 1w mixer)(3 0 icm 4    1 icm 12      2 mix2 0 0 1 48 0) (worse)
    (214772 chained & mixed)   (3 0 icm 4    1 isse 12 0   2 mix 0 0 2 14 0)
    (214604 chained & 1w mixed)(3 0 icm 4    1 isse 12 0   2 mix2 0 0 1 220 0) (better)
    hcomp
      d= 1 a<<= 9 *d=a halt   (set order 1 context for model 1)
    post
      0
    end

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Btw, my current results with book1bwt (using mix_test_bin):
    (which only uses straight o0 and o1 stats, not icm)
    214023 - rounding added to my binary logistic mixer implementation
    213953 - Stretch mapping in update optimized separately from mapping in mix

    Also, I noticed a small problem with C arithmetics which affects
    some logistic mixer implementations.
    That is, (x>>c) is not the same as (x/(1<<c)) for negative x.
    Last edited by Shelwien; 11th June 2009 at 01:37.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Hopefully not zpaq.
    When I write in the document floor(x / 2^c) I mean of course (x >> c).

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    You'd better not mean what you said, as (int(-1)>>1)=-1.
    So yeah, zpaq, and other ones.
    Check this:
    Code:
    #include <stdio.h>
    void main( void ) { printf( "%i %i %i %i\n", -1>>1, 1>>1, (-254+2)>>2, (254+2)>>2 ); }
    
    output:
    -1 0 -63 64

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Got some interesting results in support of my substring theory.

    214023 - baseline mix2
    214395 - logistic mixer with two weights
    214999 - linear update v1
    214184 - linear update v2

    Code:
    Node2j f0;
    double cl0,cl1;
    
    void Update_v1( int y, int p0,int p1 ) {
      int z, pz=f0.P;
      if( y==0 ) {
        z = (SCALE-pz)*p1 >= pz*p0;
      } else {
        z = (SCALE-pz)*(SCALE-p1) >= pz*(SCALE-p0);
      }
      f0[ctx].Update( z, M_l1wr, M_l1mw );   
      w = SCALE-f0.P + hSCALE;
    }
    
    void Update_v2( int y, int p0,int p1 ) {
      cl0 = cl0*(SCALE-M_l0wr)/SCALE - log( float(y?SCALE-p0:p0)/SCALE );
      cl1 = cl1*(SCALE-M_l0wr)/SCALE - log( float(y?SCALE-p1:p1)/SCALE );
      int z = cl1>cl0;
      f0.Update( z, M_l1wr, M_l1mw );
      w = SCALE-f0.P + hSCALE;
    }
    So, v1 constructs the submodel (o0 and o1) switching sequence (z) based
    on comparisons of coding cost of model selection + model output.
    But that's a greedy algorithm so I actually expected much worse performance
    from it - its a case similar to LZ greedy parsing.

    And v2 constructs the switching sequence based on exponentially weighted
    coding cost comparison - btw, the result becomes even better when a context
    is used for a counter there (ie for probability estimation of first submodel selection).

    Anyway, greedy switching doesn't even allow to properly initialize the switching
    sequence model, and averaged coding cost comparison might be somewhat
    delayed compared to gradient update.
    But a better parsing implementation should further improve the performance
    of this scheme. And I hope that it might be possible to reach better results
    than with gradient update using delayed counters or something like that.

    So it looks like logistic mixing is actually quite related to optimal parsing
    of processed data, and nobody noticed
    Last edited by Shelwien; 11th June 2009 at 05:27.

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Changed the switching flag definition to
    Code:
        int yp0 = y?SCALE-p0:p0;
        int yp1 = y?SCALE-p1:p1;
        int z = (cl1>cl0) || (yp0>yp1+1160);
    and now its 214010 %)
    So seems that improvements (over gradient update) are certainly possible.
    They might be much more complicated and slower though...

    And meanwhile, the current result with separately optimized mapping
    for pm=Squash(x) (to avoid clipping or something) is 213766 %).
    So it was possible to improve it almost by 1k just by tweaking the mixer
    (comparing to http://encode.ru/forum/showpost.php?p=7777&postcount=20).
    Last edited by Shelwien; 11th June 2009 at 06:28.

  23. #23
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    If we start with

    pm = w1*Log[p1/(1-p1)+w2*Log[p2/(1-p2)]
    Log[1/(1+Exp[-pm])*(1-2*y)+y]
    (pm,p1,p2 are probabilities of 0, y=bit, result is code length (negative))
    Ok, but remember to mention that you work with P(y=0). pm might be misleading since it is no linear probability, but a logistic estimation.

    it can be transformed into

    y=0: w1*Log[p1] +w2*Log[p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]
    y=1: w1*Log[1-p1]+w2*Log[1-p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]

    And then, if we convert that to probabilities:

    y=0: p1^w1*p2^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
    y=1: (1-p1)^w1*(1-p2)^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
    ((1-y)*p1^w1*p2^w2+y*(1-p1)^w1*(1-p2)^w2)/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)

    So this is what the actual mixing looks like.
    I think conversion terms the reverse transform from logistic space (your pm) to linear space, right? But you "converted" using exp(x) not Squash(x), having ln(1+exp(... instead of ln(exp(... makes the transformations you did impossible. That's not the same and the result doesn't have the meaning of a linear probability. The p_i in your equations are probabilities, but the whole expression is in logistic space, thus conversion should be done with Squash.

    ...
    If it was a probability the rest would be correct, i guess.
    And meanwhile, the current result with separately optimized mapping
    for pm=Squash(x) (to avoid clipping or something) is 213766 %).
    So it was possible to improve it almost by 1k just by tweaking the mixer
    (comparing to http://encode.ru/forum/showpost.php?p=7777&postcount=20).
    The "clipping" simply increases the dynamic range of Squash. Note that this can be useful, since the derivative (slope) near the asymptot at +-1 is rather small. In NN terms that is often done, too. The gradient descent gets slowed down by the low slope. So my suggestion would be, instead of optimizing the mappings themself (their stems) or the learning rate (as we discussed) one could simply optimize the range of Squash. In the implementation i initially gave you it was from -8...8. Thus you could save stems for squash from -10...10 and optimize the range -r ... r manually. Afterwards rescale the function to map to +-1 at the edges -r, r.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	squash.png 
Views:	338 
Size:	33.8 KB 
ID:	859  

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Log[1/(1 + Exp[-w1*Log[p1/(1 - p1)] - w2*Log[p2/(1 - p2)]])*(1 - 2*y) + y]

    then, for y=1:

    sm = w1*Log[p1/(1-p1)] + w2*Log[p2/(1-p2)]

    Exp[sm] = (p1/(1-p1))^w1 * (p2/(1-p2))^w2

    Log[1-1/(1+Exp[-sm])] = Log[1-1/(1+1/Exp[sm])] =

    Log[1-Exp[sm]/(Exp[sm]+1])] = Log[1/(Exp[sm]+1])]

    Log[ 1 / ( 1 + (p1/(1-p1))^w1 * (p2/(1-p2))^w2 ) ] =

    Log[ 1 / ( 1 + p1^w1*p2^w2 / (1-p1)^w1 / (1-p2)^w2 ] =

    Log[ ( (1-p1)^w1 * (1-p2)^w2 ) / ( (1-p1)^w1*(1-p2)^w2 + p1^w1*p2^w2 ) ] =

    w1*Log[1-p1] + w2*Log[1-p2] - Log[ (1-p1)^w1*(1-p2)^w2 + p1^w1*p2^w2 ]

  25. #25
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I admit you are right - made a mistake in calculations. Looks like i'm more error prone than Mathematica .

    Good work.

    BTW i found an alternate representation. Note p = P(y=1), b = 1-2*y. In the oppositve case replace b with -b.

    p = [(1-p1)^(b*w1) * (1-p2)^(b*w2)] / [(1-p1)^(b*w1) * (1-p2)^(b*w2) + p1^(b*w1) * p2^(b*w2)]

    What about the clipping i mentioned?
    Last edited by toffer; 11th June 2009 at 20:49.

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Shelwien View Post
    You'd better not mean what you said, as (int(-1)>>1)=-1.
    floor(x) = the largest integer not greater than x, so floor(-1/2) = -1, right?

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I mean, its a wrong behaviour for symmetric values.

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://ctxmodel.net/files/mix_test/mix_test_v4.rar

    (v3)
    215703 - dual fixed-rate linear mixer
    214695 - paq-like logistic mixer
    (v4)
    213663 - logistic mix2 with gradient descent update
    213599 - logistic mix2 with likelihood-based update

    - Logistic mixer rewritten
    - A new update based on new logistic mixing interpretation tested
    - 3 mappings per coder optimized

    Update: Tested with enwik8.bwt:
    21169967 - v3/linear
    21064282 - v3/logistic
    21038280 - v4/logistic
    21065319 - v4/likelihood

    So the new mixer seems to be too tuned to book1.
    On other hand, its still on par with "original" logistic mixer.
    Last edited by Shelwien; 12th June 2009 at 05:28.

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Shelwien View Post
    I mean, its a wrong behaviour for symmetric values.
    That's true, like in adjusting the zpaq mixer weights I use round() instead of floor(). However, in computing the mixer output prediction, floor() works just as well experimentally, so I used it to save a few instructions. In fact for this result:

    Code:
    (219581 adaptive mixer)    3 0 cm 9 3   1 cm 17 5     2 mix 0 0 2 14 0
    I tried replacing the two instances of floor() with round() in predict() and the output was 8 bytes larger. In other words, I tried inserting "+128" below:

    Code:
            for (int j=0; j<m; ++j)
              p[i]+=(wt[j]+128>>8)*p[cp[2]+j];
            p[i]=clamp2k(p[i]+128>>8);
    Inside the loop, round() made no difference at all.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts


    Seems like there're still things to work on, considering lookup table generation.
    Though I already tried exponential interpolation for Squash, instead of
    linear, and only managed to get compression worse... by 3 bytes.
    Last edited by Shelwien; 12th June 2009 at 15:57.

Page 1 of 2 12 LastLast

Similar Threads

  1. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  2. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  3. Mixing strategies
    By Alibert in forum Data Compression
    Replies: 38
    Last Post: 24th June 2009, 23:37
  4. Graphic/Image/Modeling Benchmark
    By Simon Berger in forum Data Compression
    Replies: 23
    Last Post: 24th May 2009, 17:50
  5. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 13:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •