Page 1 of 2 12 LastLast
Results 1 to 30 of 39

Thread: Mixing strategies

  1. #1
    Member
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Mixing strategies

    Concerning the topic of "mixing" several questions occurred to me.
    (Apologies for boring you with them...)

    As far as I understood, a typical mixing strategy is as follows:
    Compute weights w_i, 1 <= i <= n, for each model m_i (with
    estimation probability p_i). Then the mix p is computed as follows:
    p = (w_1 * p_1 + ... + w_n * p_n) / (w_1 + ... + w_n).

    *** Question 1: Do negative weights make sense (and if so, in which cases)? I do not see any good "physical" interpretation of negative weights but I remember once having seen negative weights when it comes to mixing with the order -1 model. Maybe this is just a mathematical "trick" to increase/decrease probabilities beyond the restriction in question 2.

    *** Questions 2+3: If the weights satisfy w_i >= 0, then we do have the restriction
    min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
    Is this a "severe" restriction in practice? Below, I try to explain what I mean by "severe".
    Does neural network mixing (e.g. as in paq) also have this restriction (since it uses non-linear mixing)?

    *** Question 4: Is there some public state-of-the-art computation of the weights w_i? Which one do you use?

    *** Question 5: Are there any other non-linear mixing strategies used (which and where?)?

    Concerning question 2: for simplicity, suppose you are given two probability estimations p_1 and p_2 (and of course, you know the input so far and what the intermediate estimations of the two models where). To me it seems that the "best" possible mix p (derived from the two models) does not need to satisfy the restriction in question 2, but of course, I do not have a proof -- since it would depend on the models. Can you think of any cases in which the best mix obviously does not satisfy the restriction (i.e., in which the restriction is "severe") or do you think that such cases cannot exist (i.e. the best possible mix has to lie between p_1 and p_2 if you have no further knowledge)?

    Best,
    Alibert

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Alibert View Post
    Concerning the topic of "mixing" several questions occurred to me.
    (Apologies for boring you with them...)
    I like this topic I hope, the discussion becomes hotter

    Quote Originally Posted by Alibert View Post
    As far as I understood, a typical mixing strategy is as follows:
    Compute weights w_i, 1 <= i <= n, for each model m_i (with
    estimation probability p_i). Then the mix p is computed as follows:
    p = (w_1 * p_1 + ... + w_n * p_n) / (w_1 + ... + w_n).
    Yes. This is well known linear variant of mixing. AFAIK, it's used in prior to PAQ7. The main disadvantage is it's division (=slow).

    Quote Originally Posted by Alibert View Post
    *** Question 1: Do negative weights make sense (and if so, in which cases)? I do not see any good "physical" interpretation of negative weights but I remember once having seen negative weights when it comes to mixing with the order -1 model. Maybe this is just a mathematical "trick" to increase/decrease probabilities beyond the restriction in question 2.
    I can't see any sense behind negative weights. If this is allowed in above equation, then "p" can be negative too under some distinct circumstances. And this is an "out-of-range" for a probability in mathematical sense.

    Quote Originally Posted by Alibert View Post
    *** Questions 2+3: If the weights satisfy w_i >= 0, then we do have the restriction
    min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
    Is this a "severe" restriction in practice? Below, I try to explain what I mean by "severe".
    Does neural network mixing (e.g. as in paq) also have this restriction (since it uses non-linear mixing)?
    Final mixed prediction usually better than "best" model's prediction. So, this restriction is not in "must have" list. Same thing is also true for PAQ.

    Quote Originally Posted by Alibert View Post
    *** Question 4: Is there some public state-of-the-art computation of the weights w_i? Which one do you use?
    For me, an ideal weighting schema should include real coding cost implicitly or explicitly. For example, PAQ's neural network mixing uses real coding cost (though it's still not ideal because of RC redundancy). PAQ's mixer is good enough to eliminate non-optimal models' predictions. So, if you feel your models don't output accurate predictions, then you can try PAQ mixer. But, if you think your modeling strategies are near-optimal, you can use some weaker but faster methods. Also, SSE like approach can be used as mixing too. Please have a look at here for a fast method: http://encode.ru/forum/showpost.php?p=2855&postcount=6

    Quote Originally Posted by Alibert View Post
    *** Question 5: Are there any other non-linear mixing strategies used (which and where?)?
    If you find, please let me know

    Quote Originally Posted by Alibert View Post
    Concerning question 2: for simplicity, suppose you are given two probability estimations p_1 and p_2 (and of course, you know the input so far and what the intermediate estimations of the two models where). To me it seems that the "best" possible mix p (derived from the two models) does not need to satisfy the restriction in question 2, but of course, I do not have a proof -- since it would depend on the models. Can you think of any cases in which the best mix obviously does not satisfy the restriction (i.e., in which the restriction is "severe") or do you think that such cases cannot exist (i.e. the best possible mix has to lie between p_1 and p_2 if you have no further knowledge)?
    As I said before, we don't have any restriction. I think, you should think mixing as totally another model which predicts the next symbol. Actually this is true. Because, "mixing" works on model predictions while "models" work on real symbols or intermediate symbols (=transformed symbols). So, mixing is kinda completely different model. But, again there is an implicit correlation to input symbols in any cases (else it would be useless).

    Best Regards
    Osman
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    Member
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by osmanturan View Post
    I like this topic I hope, the discussion becomes hotter
    Me too!
    Thank you for the quick reply.

    I can't see any sense behind negative weights. If this is allowed in above equation, then "p" can be negative too under some distinct circumstances. And this is an "out-of-range" for a probability in mathematical sense.
    Of course, if an "out-of-range" is met, then one has to execute something like p := MIN_PROBABILITY or p := MAX_PROBABILITY. Now, don't get me wrong: of course this won't be very practical; but still two questions remain: can it be useful (i.e. enhance compression ratio)? And: is there some "physical" interpretation of negative weights.

    Final mixed prediction usually better than "best" model's prediction. So, this restriction is not in "must have" list.
    By using linear mixing with positive weights, this restriction comes automatically (even if you do not want to have it). Do you think that this might be the main disadvantage over PAQ's non-linear mixing?

    ... because of RC redundancy...
    Sorry: what does "RC redundancy" stand for?

    Also, SSE like approach can be used as mixing too.
    Sorry again for my little knowledge: could you give some more details on this approach?

    Please have a look at here for a fast method: http://encode.ru/forum/showpost.php?p=2855&postcount=6
    Thank you for the link. If I understand correctly, this method is only capable of mixing two models at a time, i.e., mixing of more than two models is done by successively mixing two models. So, even if the mixing algorithm is nearly optimal when it comes to mixing two models, it is far from obvious that this is a good approach for mixing more than two models. Do you (or anybody else) have any more insight in mixing more than two models (and why the successive mixing of two models still performs quite well)?

    If you find, please let me know
    I am working on it!

    Best regards,
    Alibert

  4. #4
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts

    Hi!

    I believe that all these very interesting mathematical rules are they complicate and that the solution better than the compression both instead to see in the simple one!

  5. #5
    Member
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Nania Francesco View Post
    I believe that all these very interesting mathematical rules are they complicate and that the solution better than the compression both instead to see in the simple one!
    Maybe mathematics complicates things at first, but I think that at some point both insight and accuracy always pays off (for example consider PPM vs. PPMII).

  6. #6
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Alibert View Post
    Me too!
    Thank you for the quick reply.
    You are welcome

    Quote Originally Posted by Alibert View Post
    Of course, if an "out-of-range" is met, then one has to execute something like p := MIN_PROBABILITY or p := MAX_PROBABILITY. Now, don't get me wrong: of course this won't be very practical; but still two questions remain: can it be useful (i.e. enhance compression ratio)? And: is there some "physical" interpretation of negative weights.
    Yes clipping is another option, but still I can't see any benefit of negative weights in linear mixing like you described (of course, I can be wrong too). Because, if it was, then a model would bias the other models by it's negative weight. Strange behaviour...

    Quote Originally Posted by Alibert View Post
    By using linear mixing with positive weights, this restriction comes automatically (even if you do not want to have it). Do you think that this might be the main disadvantage over PAQ's non-linear mixing?
    Yes. This restriction comes with it. But, the main advantage of PAQ's mixing is using near-real coding cost as weighting schema. For me, the main problem for designing a completely new mixer is finding a good and fast weighting method (mixing stage rely on weighting).

    Quote Originally Posted by Alibert View Post
    Sorry: what does "RC redundancy" stand for?
    Sorry for acronym. RC is range coder. And Let me explain "RC redundancy": if we have a prediction P for next bit, then we need "-log2(P)" bits in theory for the next bit. But, in practical arithmetic coders or range coders have some redundancy (usually very small - i.e. on SFC, PAQ's RC has ~2000 bytes redundancy). So, this means a practical range coder tend to consume a bit more than theoretical values. And this is generally known as "RC redundancy".

    Quote Originally Posted by Alibert View Post
    Sorry again for my little knowledge: could you give some more details on this approach?
    In classical markovian modeling, we try to predict some symbols under some contexts by producing probabilities about them. Secondary Symbol Estimation or shortly SSE takes this approach a bit further. It maps a context and probability pair to a new probability with a slowly learning counter. In electronics sense, it's kind of low-pass filter with extra helpful input (in here it's another context). There is good and short explanation in here: http://encode.ru/forums/index.php?ac...page=5#msg9819
    You can also have look into PAQ3 source code.

    Quote Originally Posted by Alibert View Post
    Thank you for the link. If I understand correctly, this method is only capable of mixing two models at a time, i.e., mixing of more than two models is done by successively mixing two models. So, even if the mixing algorithm is nearly optimal when it comes to mixing two models, it is far from obvious that this is a good approach for mixing more than two models. Do you (or anybody else) have any more insight in mixing more than two models (and why the successive mixing of two models still performs quite well)?
    Yes. It's only for two models at a time. But, it have it's own advantages. For example, you can cut down mixing tree at some points under some circumstances. I was trying to make multi-level mixer which relies on near-real coding cost. But, in the end, I had lots of very complex computations. Lookups could be used in here. But, this time it would be equivalent to PAQ's mixer.

    Quote Originally Posted by Alibert View Post
    I am working on it!
    If you make some progress, please let me know.

    Best Regards
    Osman
    BIT Archiver homepage: www.osmanturan.com

  7. #7
    Member
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by osmanturan View Post
    In classical markovian modeling, we try to predict some symbols under some contexts by producing probabilities about them. Secondary Symbol Estimation or shortly SSE takes this approach a bit further. It maps a context and probability pair to a new probability with a slowly learning counter. In electronics sense, it's kind of low-pass filter with extra helpful input (in here it's another context). There is good and short explanation in here: http://encode.ru/forums/index.php?ac...page=5#msg9819
    You can also have look into PAQ3 source code.
    If I understand things correctly, using SSE for mixing works by using the probabilities of the different models (maybe with additional context) as inputs. I suppose the "practical" approach again is successively mixing two models. Has anybody tried this approach with more than two models? Also, I do not see how interpolation works with more than two models.

    It seems that neural networks is the only (??) approach which is used in practice and which is capable of mixing more than two models at once. It seems that all other (heuristical?) linear approaches can be improved by using the binary linear mixing algorithm explained earlier. Am I right?

    Best,
    Alibert

  8. #8
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Alibert View Post
    If I understand things correctly, using SSE for mixing works by using the probabilities of the different models (maybe with additional context) as inputs.
    Actually SSE needs one probability and one context for outputting one probability at a time. But, in mixing sense you can extend this idea a bit further. Instead of using single probability as input, you can design it as 2D map (actually it's a volume - 3D). You should keep it as small as possible. Because, small cached system suffer from this design. So, you have to use some quantization. For example:
    Code:
    q1=Quantize(p1);
    q2=Quantize(p2);
    activeModel = &SSE[context][q1][q2];
    p = activeModel.P();
    ...
    activeModel.Update(y);
    In here, p1 and p2 are input probabilities. "Quantize" a quantize function (if you will use a lookup table, you have to keep it small as possible too). q1 and q2 are quantized probabilities. "activeModel" is counter model pointer. SSE is array of counter models. "p" is adjusted probability. "y" is just coded bit. I think, it's clear enough.

    Quote Originally Posted by Alibert View Post
    I suppose the "practical" approach again is successively mixing two models. Has anybody tried this approach with more than two models? Also, I do not see how interpolation works with more than two models.
    Yep. Both me and Shelwien tried "successive mixing of two models". And probably also toffer tried that idea. PAQ9A's design relies also this mixing schema. You can use several combinations. For example:
    p=MIX(p0, MIX(p1, MIX(p2, p3)));
    or
    p=MIX(MIX(p0, p1), MIX(p2, p3));
    etc.

    Quote Originally Posted by Alibert View Post
    It seems that neural networks is the only (??) approach which is used in practice and which is capable of mixing more than two models at once. It seems that all other (heuristical?) linear approaches can be improved by using the binary linear mixing algorithm explained earlier. Am I right?
    PAQ's mixer is widely known just because of PAQ's popularity. I'm sure there are another ways to do multi-input mixing. But, because of complexity we again need lookup tables and as I pointed out before there is a high chance to get equivalent code path.

    I think, first of all you should decide one thing: near-ideal mixing (slow) or practical mixing (fast)?

    If you want to go on practical mixing, you should keep in your mind common modeling techniques tend to produce probabilities near to 0/1. So, you can discard something like toffer pointed out. And you can have look of it's proof at this post: http://encode.ru/forum/showpost.php?p=4282&postcount=65

    For motivation, you can build a very simple coder which mixes two models. And you can then improve it. Maybe you will find another good mixing schema in the end Good luck. I look forward to hearing any progress.
    BIT Archiver homepage: www.osmanturan.com

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > *** Question 1: Do negative weights make sense (and if so,
    > in which cases)? I do not see any good "physical"
    > interpretation of negative weights but I remember once
    > having seen negative weights when it comes to mixing with
    > the order -1 model. Maybe this is just a mathematical
    > "trick" to increase/decrease probabilities beyond the
    > restriction in question 2.

    An example where negative weights would be useful:
    Suppose that we have an "inverted" binary model, which
    gives a probability-of-0 estimation 0.0 when next bit=0
    and 1.0 when next bit=1.
    Then a perfect weighting scheme would be
    p_1 = 0.5
    p_2 = inverted estimation
    p = ( 2*p_1 + (-1)*p_2 ) / ( 2 + (-1) ) = 1-p_2
    And in fact, its quite a practical case, as with
    context sequences similar to 01010101 most of common
    estimators would be one step behind.
    But imho its better to handle such cases by specific
    models, or at least by assigning separate weights to (1-p_i)
    values, instead of allowing negative weights.

    > min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
    > Is this a "severe" restriction in practice?

    This restriction might really cause a redundancy in some cases.
    But its really easy to fix - by including p_0=0 and p_n+1=1.

    > *** Question 4: Is there some public state-of-the-art computation of the weights w_i?
    > Which one do you use?

    Aside from straightforward ad hoc models like ppmy and ash (which directly
    use the weighting scheme you described) and paq version its hard to name something,
    as its a "slow" approach.
    Also paq's weight update function won't be particularly useful in linear space.
    So its kinda hard to tells what's "state-of-art" here.
    But in fact its possible to derive a weight distribution from probability distributions
    given by any practical PPM or CM models (and even LZ, if you really try).

    Anyway, there's certainly no popular data type produced by a generator based on
    a linear vector mixing model, so there's not much reason in finding a "perfect"
    weight estimation method for such a scheme.
    I can give you a possible candidate, though - after processing each symbol you
    can perform a weight optimization for minimal compressed size of already processed
    data (using GA or any other optimization algorithm).

    > *** Question 5: Are there any other non-linear mixing strategies used (which and where?)?

    Not really, as there's not much "physical sense" in that (you can't explain how even
    paq's model can find a perfect asymptotical approximation for simple source models like:
    b1 = (random(0..1)>=p1);
    b2 = (random(0..1)>=p2);
    output( random(0..1)>=p3 ? b1 : b2 );
    though with linear weighting it would be possible:
    p = (1-p3)*p1 + p3*p2 )

    But, afair, CTW's mixing is non-linear as well.
    Also, any PPMs (due to symbol masking) and any coders with secondary estimation.

  10. #10
    Member
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > *** Question 1: Do negative weights make sense?

    An example where negative weights would be useful...
    Thank you for your simple yet enlightening example. As far as I understand, using negative weights can eliminate "bad influence" of some models in other models; of course, the main problem is identifying such correlations (both qualitatively and quantitatively).

    But imho its better to handle such cases by specific
    models, or at least by assigning separate weights to (1-p_i)
    values, instead of allowing negative weights.
    Could it be that separate weights for p_i and (1-p_i) yield inconsistency (i.e. weighted probabilities do not sum up to 1)?

    > min(p_1, ..., p_n) <= p <= max(p_1, ..., p_n).
    > Is this a "severe" restriction in practice?
    This restriction might really cause a redundancy in some cases.
    But its really easy to fix - by including p_0=0 and p_n+1=1.
    I suppose one has to use fixed weights w_0 and w_n+1, since adaptive weights in most settings will tend to 0. On the other hand, in linear mixing this will simply yield the addition of a number which only depends on the other weights. This looks strange... (maybe I am just a little bit confused ).

    I am still unsure whether PAQ's neural network mixing does have the same restriction (which of course could be solved using your p_0 / p_n+1 approach). To me, it appears as if SSE is the only approach (which is currently applied) to get rid of the above restriction.

    <mixing> is a "slow" approach.
    Do you also consider toffer's 0/1 modification as a slow approach? (This is nothing personal; to me toffer's approach seems very efficient and I am curious about your opinion on what to consider "slow".) Maybe the "slowness" comes from the multiple models (which in a lot of cases are doing rather similar things; e.g. Markov models of different orders, use of determinism in sources by applying KT-estimators and match predictors,...).

    Anyway, there's certainly no popular data type produced by a generator based on
    a linear vector mixing model, so there's not much reason in finding a "perfect"
    weight estimation method for such a scheme.
    Sorry, what do you mean by "generator"? I am afraid, I cannot follow your statement. Could you please explain it a little further?

    I can give you a possible candidate, though - after processing each symbol you
    can perform a weight optimization for minimal compressed size of already processed
    data (using GA or any other optimization algorithm).
    This does not seem to be very "practical" but on the other hand, it surely would be interesting for comparisons. Thank you for sharing!

    Best,
    Alibert

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Could it be that separate weights for p_i and (1-p_i)
    > yield inconsistency (i.e. weighted probabilities do not
    > sum up to 1)?

    I mean like (w1*p1+w2*(1-p1))/(w1+w2)... so no inconsistency.

    > I suppose one has to use fixed weights w_0 and w_n+1,
    > since adaptive weights in most settings will tend to 0.

    I don't think so. Also anyway, if they would be really adaptive,
    then they would also adapt specifically for the cases where
    optimal p is out of min/max bounds.

    > Do you also consider toffer's 0/1 modification as a slow
    > approach?

    There's PPM which attempts to encode most of the
    symbols using the probability directly taken from statistics
    of a single context.
    Comparing to PPM, mixing is always slower, even if its
    a static mixing (that is, a speed-optimized PPM would be
    faster than CM with the same compression level, but
    reverse is not true).

    > (This is nothing personal; to me toffer's approach seems

    Actually there were people who used such things
    like 10 years ago (Serge Osnach, for example)

    Also, here's an original post on that kind of mixer:
    http://ctxmodel.net/rem.pl?-12
    (It should be somewhere in the old forum too).

    > very efficient and I am curious about your opinion on what
    > to consider "slow".)

    I don't really care about slowness if its computable.
    In fact, I don't care about practical compression either.
    For me, compression is just an objective measure of
    model's performance.

    > Maybe the "slowness" comes from the multiple models

    Well, that too (as CM commonly updates all the submodels
    while PPM doesn't need to update or even access them).
    But also adaptive mixing involves multiplication, and
    that's not exactly fast.

    Just for reference, it seems that people only like
    compressors with >2Mb/s speed (for practical use).

    > Sorry, what do you mean by "generator"? I am afraid, I
    > cannot follow your statement. Could you please explain it
    > a little further?

    Ah, well.
    There's a common "magical" approach to compression, where
    people try to invent a simple universal formula which would solve
    all the problems.
    But in fact its kinda different.
    For each data sample, we have a source, which generates it by
    a certain algorithm (http://cs.fit.edu/~mmahoney/compression/uiq/ is related).
    So an ideal compressor should find the best fitting parameters for that algorithm
    by given data.
    But commonly the source algorithm is either too complex (like with executables
    compiled by gcc), or unknown ("natural" text).
    So we can only build very approximate models, based on visible dependencies in
    the data. And what do we do when there're different kinds of dependencies in
    different parts of data? If we can't clearly differentiate the type of data,
    then we can only assign probabilities to these dependencies, but to avoid redundancy
    we have to integrate the symbol intervals from different submodels, and it
    becomes a linear mixing scheme. And if there're some conditions affecting
    the submodels probabilities, that becomes your weighting function.
    But submodels are important here, not the "universal" mixing.
    There's no reason to improve the mixing, while trying to compress audio data
    using text-oriented submodels.

    > > I can give you a possible candidate, though - after processing each symbol you
    > > can perform a weight optimization for minimal compressed size of already processed
    > > data (using GA or any other optimization algorithm).
    >
    > This does not seem to be very "practical" but on the other
    > hand, it surely would be interesting for comparisons.

    Well, it is practical enough. Once I spent some time developing a
    series of mixer implementations (mostly) based on something called BFA
    (Bruteforce Adaptation). The initial idea was that we can restrict
    the weight value set and accumulate the code length for all possible
    values, but there appeared to be a lot of ways of optimization.
    Last edited by Shelwien; 11th March 2009 at 17:57.

  12. #12
    Member
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The following seems to be a new mixing strategy (beta weighting):

    http://www.informatik.uni-stuttgart....kbf09wic30.pdf

    Does anybody know how it compares with

    http://encode.ru/forum/showpost.php?p=2855&postcount=6

    (speed/compression ratio)?

    Cheers,
    Alibert

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I'm very glad to see that Edgar Binder is still there and works on data compression,
    so thanks for posting the link.

    But as to the mixer itself, it seems to be pretty much bayesian, so nothing new.
    Of course, it doesn't mean that the proposed scheme is bad or doesn't work.
    But still there're some flaws:
    1. Its slow and there's a lot of precision-related troubles.
    2. Its best for generators based on static order0-controlled
    switching (not mixing) of stationary sources, and such "natural"
    generators basically don't exist.
    3. Its not so visible in the form of bpc, but the best listed result
    for book1 seems to be 2.249/8*768771=216121 (with 25 mixed
    submodels), while http://ctxmodel.net/files/tree_v1.rar produces
    215719 with up to 17 submodels and static weighting.

    Thus, I can't say that practical implementations of this method
    are impossible, but they might appear surprisingly similar to some other
    methods.

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Interesting paper. But the calculation of p in section 3.1 immediately shows why the results are poor. It is the wrong way to mix probabilities no matter what weights you use. You can't just average them. You need to also consider the confidence of each model.

    Suppose you have two models that predict that the next bit will be a 1 with probabilities 0.5 and 0.98. The first model doesn't know, but the second is highly confident that the next bit will be 1. Therefore the second model should get greater weight. Linear mixing would predict 0.74, but a better prediction would probably be 0.9. Depending on weights to handle this won't work, because confidences depend on context which can change rapidly from one bit to the next.

    Both paq6 and paq8 use confidences in addition to adjusting model weights. In paq6, each model outputs a count of zeros and ones which are weighted and summed. A prediction near 0 or 1 requires one of the two counts to be large, which means that the sum of both counts will also be larger than if the prediction is near 0.5. paq8 models outputs probabilities but they are first stretched (p -> log(p/(1-p)) before mixing and squashed (inverse of stretch) afterward. Stretching gives large absolute values when p is near 0 or 1. A prediction of p = 0.5 has weight 0.

    It is not clear from the paper just what compression algorithm they are using, but they mention that using a separate weight for each model state improves compression but they don't know why. Well, I will tell you why. Because probability and confidence both depend on state. If a prediction is near 0 or 1, it will usually be right, so those weights will grow.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://ctxmodel.net/files/mix_test_v0.rar

    Simple bytewise coding demo (o0/o1) for mixer comparison

    Code:
    book1bwt
    768771 - Uncompressed
    323522 - Plain order0 
    282728 - Plain order1
    273999 - Static order0+order1 mix (optimized)
    271401 - Adaptive order0+order mix (optimized)
    I'd like someone (preferably Matt) to attach the paq mixer to
    this sample, to practically demonstrate how superior it is.

    Statistics are frequency-based there, like in old compressors,
    and are relatively unfit for BWT data.
    But there's an improvement from using static and then adaptive
    mixing, so a superior universal mixer should provide even better
    results without changing the inputs.
    Last edited by Shelwien; 5th June 2009 at 16:49.

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    By toffer's request here're some versions more compatible
    with paq mixer... apparently it only works with binary
    distributions or something.

    http://ctxmodel.net/files/mix_test_v1.rar
    Code:
    323043 - Plain order0 
    282051 - Plain order1
    273806 - Static order0+order1 mix
    271539 - Adaptive order0+order1 mix
    
    v1
     - Probabilities accessible at mixing stage
     - Reoptimized to probabilities (slightly worse results)
    http://ctxmodel.net/files/mix_test_v2_bin.rar
    Code:
    235937 - Plain order0 
    235584 - Plain order1
    216816 - Static order0+order1 mix
    216098 - Adaptive order0+order1 mix
    
    v2 
     - Alternate bitwise model
     - Reoptimized
    Still, I'd really prefer to get a mixer improvement for v0/v1 than
    for v2 (as its less trivial) - its exactly the mixer's purpose to
    get all the possible use out of any submodels, even bad ones.
    But being able to evaluate v2 with paq mixer would be helpful too.
    Last edited by Shelwien; 5th June 2009 at 20:09.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Interesting paper. But the calculation of p in section 3.1
    > immediately shows why the results are poor.

    Yeah... Along with bayesian statistics in whole, I guess.

    > It is the wrong way to mix probabilities no matter what
    > weights you use. You can't just average them.

    1. Static mixing is fast. And simple adaptive linear
    mixing is fast enough too.
    2. Linear mixing is the obvious solution for generators
    with source model switching. Mixer is the model of
    source switching control stream.

    I really wonder what's your theory about degenerate cases
    like with all submodel inputs being static.
    But to me it seems that paq mixer output would converge
    really fast and won't adapt to data at all in such a case.

    > You need to also consider the confidence of each model.

    Sure, its very helpful.
    But first its necessary to define that "confidence".
    It might be relatively clear for high orders which
    don't have much statistics, but not all models are
    like that. And including some ad hoc value into
    the probability estimation would usually hurt the compression.

    > Suppose you have two models that predict that the next bit
    > will be a 1 with probabilities 0.5 and 0.98. The first
    > model doesn't know, but the second is highly confident
    > that the next bit will be 1. Therefore the second model
    > should get greater weight.

    To me it sounds totally wrong.
    That is, if we'd decide that based on numbers of model context
    occurences (or some smarter things), it might be ok.
    And it might be also reasonable for some specific compressor
    implementations, where probability is initialized with 0.5,
    so that the ~0.5 estimation would likely mean an "uneducated"
    model.
    But that's totally implementation specific, as its obvious
    that there might be completely random bits, with their
    locations predictable with 100% confidence.

    > Both paq6 and paq8 use confidences in addition to
    > adjusting model weights. In paq6, each model outputs a
    > count of zeros and ones which are weighted and summed.

    Now, that's really useful, but fast compressors might
    not be able to afford that.

    > It is not clear from the paper just what compression
    > algorithm they are using,

    Its described in section 4 - DMC and a bitwise match model.

    > Because probability and confidence both depend on state.
    > If a prediction is near 0 or 1, it will usually be right,
    > so those weights will grow.

    Afaik its better to decide that based on secondary statistics.

  18. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Here are some results for book1bwt with zpaq.

    Code:
    (results for zpaq c(this_file) archive book1bwt)
    (note: all parameters are tuned)
    comp 2 0 0 0 (hh hm ph pm)
                              (uncomment any of below to run)
    (235069 simple order 0)   (1 0 cm 9 3)
    (237916 simple order 1)   (2 0 const 0  1 cm 17 5)
    (220811 static mixer)     (3 0 cm 9 3   1 cm 17 5     2 avg 0 1 120)
    (219581 adaptive mixer)   (3 0 cm 9 3   1 cm 17 5     2 mix 0 0 2 14 0)
    
    (221070 indirect order 0) (1 0 icm 4)
    (231883 indirect order 1) (2 0 const 0  1 icm 12)
    (215055 indirect chained) (2 0 icm 4    1 isse 12 0)
    (214936 ind static mixer) (3 0 icm 4    1 icm 12      2 avg 0 1 160)
    (214514 ind adapt mixer)  (3 0 icm 4    1 icm 12      2 mix 0 0 2 14 0)
    (214772 chained & mixed)  (3 0 icm 4    1 isse 12 0   2 mix 0 0 2 14 0)
    hcomp
      d= 1 a<<= 9 *d=a halt   (set order 1 context for model 1)
    post
      0
    end
    Note the const models don't do anything. I put them there so I didn't have to rewrite the hcomp code.

    To keep it simple, contexts are direct lookups (or equivalently no hash collisions) and there is no mixer context. I tuned static mixer weights (avg), adaptive mixer learning rates (mix) and simple context model minimum adaption rate (cm) for max compression.

    I could reduce sizes by 20 bytes by using b command instead of c (to omit SHA1 checksum).

    As you might know, zpaq does all mixing (static or adaptive) in the logistic domain (stretch(p) = log(p/(1-p)) with a final squash at the end (squash(p) = 1/(1+exp(-p))).

    Direct models (cm) map a context to a prediction and use a variable adaption rate that starts fast and depends on a count. Indirect models (icm) map a context to a bit history and then to a prediction which is adapted at a fixed rate.

    Chaining (isse) maps a context to a bit history which selects a mixer context for the previous model and a fixed constant.

    http://cs.fit.edu/~mmahoney/compression/#zpaq

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, thanks for information.
    But its kinda unrelated to what I was talking about.
    That is, doesn't prove that logistic mixer is better than linear one or anything.

    Fortunately, it seems that toffer did the thing I asked about, instead.
    Would look at it, maybe tune the parameters, and post.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://ctxmodel.net/files/mix_test_v3.rar

    So here we have a proof that a paq-like mixer
    is better than the one from "sh_mixer.inc".
    But now the question is whether its possible
    to design a linear mixer equivalent to logistic one,
    with "confidence".

    Code:
    215703 - dual fixed-rate linear mixer
    214695 - with paq-like logistic mixer
    
     - Added a logistic mixer provided by toffer
     - Added a second instance of linear mixer (with same inputs)
       for more fair comparison (so that both mixers would have
       two state variables)
    
    ToDo:
     - Implement a linear mixer based on a counter 
       with bit frequencies
     - Optimize the stems[] table for paq_mixer
     - Clean paq_mixer and add more parameters (lim & mw in update)
    P.S. Already doing that stems[] optimization... 214339 atm,
    kinda interesting.

  21. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks for testing. I can't do linear mixing with zpaq without changing its basic architecture. But I did lots of experiments like this years ago on older versions of paq and also more recently with early versions of zpaq. As I wrote the spec I realized that making all predictions logistic was a more elegant solution (mixers and SSE are much simpler) and there was no reason to have linear predictions at all until time to arithmetic code.

    Logistic mixing is equivalent to linear mixing if the models give similar outputs, like with SSE bypass mixing in paq8. (zpaq would still uses logistic mixing for this). That looks like what the beta weighting did, since they were mixing DMC models with different parameters, rather than mixing different orders. The exception was the match model, but it seems possible to mix well because beta weighting adapts rapidly and the match model tends to stay on or off for long periods. I would be surprised if you could get good results with linear mixing of different orders (not BWT).

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I would be surprised if you could get good results with
    > linear mixing of different orders (not BWT).

    Are these results that bad: http://ctxmodel.net/files/MIX/mix_v6_SFC.htm ?
    Or what about Ash? (http://ctxmodel.net/files/ASH/)
    I only use linear mixing and can't say I'm dissatisfied with its performance.

    Btw here's the new stems table (the result didn't get any better than 214339)
    http://ctxmodel.net/files/mix_test/stems1.png (New values in red).
    And here's the corresponding y=stretch(x) plot:
    http://ctxmodel.net/files/mix_test/stems2.png
    (new values in red, (new-old)*4 in green).

  23. #23
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    When talking with Eugene about the PAQ mixer i came accross the derivation of the logistic mixer again. I never read it anywhere so i decided to post it.

    The PAQ mixer

    Coding cost h of bit y:

    h = -log2(P(y)) = -1/ln2 * ln(P(y))

    Let p = P(y=1) be a probability estimation and a=1-y, b=2*y-1,
    thus we can rewrite - dropping the constant:

    -ln(a+b*p)

    A logistic mixer (PAQ) computes the prediction mix p as:

    p = Sq(x)
    x = sum j w_j*x_j
    x_j = St(p_j)

    where x_j is the stretched input probability p_j. St(x) = ln(x/(1-x))
    and Sq(x) = 1/(1+e^-x). Gradient descent can be used as an iterative
    solver for the coding cost minimization:

    w_j' = w_j + L*d_j
    d_j = -dh/dw_j

    Hence the gradient grad H = (d_0 d_1 ... ) is required:

    dh/dw_j =
    = - d/dw_j ln(a+b*p)
    = - b/(a+b*p) * dp/dw_j
    = -1/(a/b + p) * dp/dw_j

    The partial derivative of p = Sq(x) (x, Sq(x)see above) is

    d/dw_j Sq(x) =
    = dSq/d_x dx/d_wj
    = Sq(x)*(1-Sq(x))*x_j
    = p*(1-p)*x_j

    Thus dh/dw_j becomes

    dh/dw_j =
    = -1/(a/b + p) * p*(1-p)*x_j

    Evaluating a/b for y={0,1} yields:

    a/b + p, y=0 -> -1 + p = -(1-p)
    a/b + p, y=1 -> 0 + p = p
    a/b = -(1-y)

    Again, reinsertion into dh/dw_j gives us

    y=0: -1/(-1*(1-p)) * p*(1-p)*x_j = -(0-p)*x_j
    y=1: -1/p * p*(1-p)*x_j = -(1-p)*x_j

    Which can be simplified using e=y-p

    dh/dw_j = -e*x_j = -(y-p)*x_j

    Finally a weight update using gradient descent can be obtained via

    w_j' = w_j + L*(y-p)*x_j

  24. #24
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    ...and there's something more we came across. Exponential decay is a strategy of modeling a non-stationary processes, and usually can be expressed like:

    v(k) = sum i=0..k-1 c^(k-1-i)*w(i) = w(k-1) + c*w(k-2) + c^2*w(k-3) ...

    Iteratively it can be expressed to be (0<=c<=1):

    v(k) = c*v(k-1) + w(k)

    Something interesting happens to a gradient weight update:

    w(k) = w(k-1) + L*d(k)

    Which doesn't look like exponential smoothing at all (otherwise it'd look like
    w(k) = c*w(k-1) + L*d(k))

    Now suppose the constant L to be decomposed into (1-c)*M:

    w(k) = w(k-1) + (1-c)*M*d(k)
    = w(k-1) + (1-c)*M*d(k) - c*w(k-1) + c*w(k-1)
    = c*w(k-1) + w(k-1) + M*d(k) - c*w(k-1) - c*M*d(k)
    = c*w(k-1) + (1-c)*(w(k-1) + M*d(k))

    Applying this decomposition we see that the w(k) update can be interpreted to look like:

    w' = c*w_old + (1-c)*w_new

    Which is called exponential smoothing and related to exponential decay:

    w(k) =
    = c*w(k-1) + (1-c)*d(k)
    = c^2*w(k-2) + c*(1-c)*d(k-1) + (1-c)*d(k)
    = c^3*w(k-3) + c^2*(1-c)*d(k-2) + c*(1-c)*d(k-1) + (1-c)*d(k)
    ...
    = c * [c^(k-1)*w(0)] + (1-c) *[ c^(k-1)*d(0) + c^(k-2)*d(1) + ... c*d(k-1) + d(k) ]

    Obviously w(k) is a linear combination of w(0) = w0 aged k-1 times (initial weight) and a exponential weighted sum d(i).

    The odd thing is that w/o applying such a mathematical trick L=(1-c)*M - which is valid and doesn't change anything to the weight update formula one cannot see that exponential smoothing is involved here:

    w(k) = w(0) + L*(d(1) + d(2) + ... d(k))

    Any comments on that?
    Last edited by toffer; 7th June 2009 at 23:05.

  25. #25
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    'w' stands for weight.
    'c' stands for context?
    'v' stands for mixed output?
    'L' stands for?
    'd' stands for?
    'M' stands for Model?

    Yes I am a noob.

  26. #26
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    L, M, c are real scalar constants. I assumed L = (1-c)*M.

    w(k), v(k), d(k) are some time series, k is the current step.


    Just to show a general form of exponential weightend estimation:

    explicit formulation
    (1) v(k) = sum i=0..k-1 c^(k-1-i)*w(i) = w(k-1) + c*w(k-2) + c^2*w(k-3) ...

    implicit formulation
    (2) v(k) = c*v(k-1) + w(k)


    A gradient descent weight update (as with the logistic mixer):

    (3) w(k) = w(k-1) + L*d(k)

    here w(k) is a weight and L*d(k) a weight change.

    w(k) = w(k-1) + L*d(k)
    = w(k-1) + (1-c)*M*d(k) + ( 0 )
    = w(k-1) + (1-c)*M*d(k) + (- c*w(k-1) + c*w(k-1))
    = c*w(k-1) + w(k-1) + M*d(k) - c*w(k-1) - c*M*d(k)
    = c*[w(k-1)] + (1-c)*[w(k-1) + M*d(k)]

    (4) w(k) = c * w_old + (1-c) * w_new

    w_new = w(k-1) + M*d(k)
    w_old = w(k-1)

    (4) was derived from (3) using a trick. (4) and (3) are equivalent, but (3) doesn't have a form like exponential smoothing (2), whereas (4) has a form like:

    (5) w(k) = c*w(k-1) + (1-c)*x(k)

    x(k) = w(k-1) + M*d(k)

    (5) contains exponential smoothing like in (1):


    w(k) = c*w(k-1) + (1-c)*x(k)
    = c*w(k-1) + (1-c)*x(k)
    = c^2*w(k-2) + c*(1-c)*x(k-1) + (1-c)*x(k)
    = c^3*w(k-3) + c^2*(1-c)*x(k-2) + c*(1-c)*x(k-1) + (1-c)*x(k)
    ...
    = c * [c^(k-1)*w(0)] + (1-c) *[ c^(k-1)*x(0) + c^(k-2)*x(1) + ... c*x(k-1) + x(k) ]

    EDIT: Problem solved - compared apples to oranges. A sum of weight changes d(k) equals an exponential decayed sum of actual weights w(k)
    Last edited by toffer; 8th June 2009 at 20:32.

  27. #27
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Thanks Toffer. I will study this carefully.

  28. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Toffer, I think that is the same derivation I used when I designed the new mixer for paq7. What I found interesting was that the weight update was simpler than back-propagation. For back-propagation, the goal is to minimize the mean square output error rather than the coding cost. When you differentiate w_i with respect to squared error (y-p)^2, you get:

    w_i := w+i + L*x_i*(y-p)*p*(1-p)

    where w_i is the weight, L is the learning rate, x_i is the input, y is the desired output and p (0 < p < 1) is the actual output. When you differentiate w_i with respect to coding cost of bit y with probability p, the factors p*(1-p) go away. I mentioned that in the paq7 comments but omitted the derivation.

  29. #29
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    MSE makes some sense but would be a wrong metric there.

    Luckily the form of stretch concides with the model output histogram (count the number of times at which probs. the bits are coded) - higher resolutions near 0/1, where most predictions are made. And in this case the actual probability concides with some kind of "confidence" as well. But a high probability is not realated to some confidence at all.

    There're some polts of model histograms:
    http://encode.ru/forum/showthread.php?t=159

    That would be completely different for a bytewise model. The NN mixer wouldn't work as well in that case. In my eyes that is caused by the alphabet decomposition.

  30. #30
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Is there a mixing strategy when there is a variable amount of models involved?
    I?ve tried to implement the Paq mixer into my compression algorithm, but I found out that the Paq mixer expects a constant amount of models for its input.

Page 1 of 2 12 LastLast

Similar Threads

  1. Logistic mixing & modeling
    By toffer in forum Data Compression
    Replies: 44
    Last Post: 3rd March 2011, 06:25
  2. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  3. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  4. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 13:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •