Page 1 of 2 12 LastLast
Results 1 to 30 of 33

Thread: Question about LPAQ1 mixer

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts

    Question about LPAQ1 mixer

    I'll paste the code I'm referring to:
    //////////////////////////// Mixer /////////////////////////////

    // Mixer m(N, M) combines models using M neural networks with
    // N inputs each using 4*M*N bytes of memory. It is used as follows:
    // m.update(y) trains the network where the expected output is the
    // last bit, y.
    // m.add(stretch(p)) inputs prediction from one of N models. The
    // prediction should be positive to predict a 1 bit, negative for 0,
    // nominally -2K to 2K.
    // m.set(cxt) selects cxt (0..M-1) as one of M neural networks to use.
    // m.p() returns the output prediction that the next bit is 1 as a
    // 12 bit number (0 to 4095). The normal sequence per prediction is:
    //
    // - m.add(x) called N times with input x=(-2047..2047)
    // - m.set(cxt) called once with cxt=(0..M-1)
    // - m.p() called once to predict the next bit, returns 0..4095
    // - m.update(y) called once for actual bit y=(0..1).

    inline void train(int *t, int *w, int n, int err) {
    for (int i=0; i<n; ++i)
    w[i]+=t[i]*err+0x8000>>16;
    }

    inline int dot_product(int *t, int *w, int n) {
    int sum=0;
    for (int i=0; i<n; ++i)
    sum+=t[i]*w[i];
    return sum>>8;
    }

    class Mixer {
    const int N, M; // max inputs, max contexts
    int* tx; // N inputs
    int* wx; // N*M weights
    int cxt; // context
    int nx; // Number of inputs in tx, 0 to N
    int pr; // last result (scaled 12 bits)
    public:
    Mixer(int n, int m);

    // Adjust weights to minimize coding cost of last prediction
    void update(int y) {
    int err=((y<<12)-pr)*7;
    assert(err>=-32768 && err<32768);
    train(&tx[0], &wx[cxt*N], N, err);
    nx=0;
    }

    // Input x (call up to N times)
    void add(int x) {
    assert(nx<N);
    tx[nx++]=x;
    }

    // Set a context
    void set(int cx) {
    assert(cx>=0 && cx<M);
    cxt=cx;
    }

    // predict next bit
    int p() {
    return pr=squash(dot_product(&tx[0], &wx[cxt*N], N)>>8);
    }
    };

    Mixer::Mixer(int n, int m):
    N(n), M(m), tx(0), wx(0), cxt(0), nx(0), pr(2048) {
    assert(n>0 && N>0 && M>0);
    alloc(tx, N);
    alloc(wx, N*M);
    }


    Function p() does weighted summation of stretched predictions and then squashes the result. My concern is about the weighted summation. AFAIU, sum of weights has to be more or less constant for the whole algorithm to be working correctly. But it's not obvious by looking at the train() function. Bo you have some simple explanation of the convergence? What are possible errors, ie distance from the ideal weight sum (that is a fixed constant) caused by rounding errors or something like that?


    A side question:
    What is most time consuming in LPAQ1? Is that retrieving data from hashtables or the mapping and mixing (ie heavy computations) themselves?

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Weights don't have to add to 1, although they usually are close. In ZPAQ, I have two types of mixers. a MIX uses 20 bit weights (range -8 to 8 with 16 bit fraction) and 12 bit stretched predictions (-8 to 8 with 8 bit fraction representing ln(p(1)/p(0)). It takes any number of inputs. A MIX2 takes 2 inputs and uses a single 16 bit weight, constrained to the range 0 to 1.

    Depending on the model, one or the other might work better. An ISSE mixes a prediction with a constant 1 using a pair of weights selected by a context mapped to a bit history. I used a 2 input MIX rather than a MIX2 because it compresses better.

    When I use a SSE (APM in PAQ) to adjust the final bit prediction, I use a MIX2 to mix its input and output. This works better than a 2 input MIX and better than PAQ's method of linear mixing with a constant weight (3/4 output, 1/4 input).

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Weights don't have to add to 1, although they usually are close.
    That's the tricky part for me.

    If they don't add to 1, but eg to 2, then if all models give a prediction of 0.5, then the final prediction will be 1.0, which would be far far away from the original predictions. Is there any proof that functions like train() are numerically stable and the distance of the weight sum from 1 (ie the imprecision) is bounded?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > That's the tricky part for me.

    In theory these weights are probabilities (of each submodel being the best for coding of the next bit),
    so they have to add up to 1.0.
    That's why I prefer to use chains of mixers with 2 inputs - they're a little more precise, have more parameters to tune,
    don't really increase the amount of computations, and there's no problem with weight normalization.

    Though in any case, the squash function would map anything to a correct probability interval, and weight space overflow
    may even improve the compression indirectly (based on the same effect as my "probability extrapolation").

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Weights only need to add to 1 for linear mixing.

    > That's why I prefer to use chains of mixers with 2 inputs

    A quick test.

    zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255c0,0,255,255,255m -> 837929 (order 1, 2, 3 ICM and 3 input order 0 MIX with unconstrained weights).

    zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255tc0,0,255,255,255t -> 851189 (1, 2, MIX2, 3, MIX2 with weights constrained to add to 1).

    zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255mc0,0,255,255,255t -> 842967 (replace first MIX2 with MIX, removing constraint).

    zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255mc0,0,255,255,255m -> 838360 (replace second MIX2 with MIX (but takes all components as input)).

    zpaq a x calgary -v 0 -m 4nc0,0,255i2,1 -> 842012 (ICM1, ISSE2, ISSE3 chain with 2 implied 2 input MIX with unconstrained weights).

    zpaq a x calgary -v 0 -m 4nc0,0,255i2,1m -> 832878 (as above with a final 3 input MIX).

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Weights only need to add to 1 for linear mixing.

    Not really, if we'd mix (p-0.5) values, and cut off <=0 and >=1 overflows.

    > A quick test.

    Thanks, but I don't think that it tells us anything.
    My opinion is based on a specific theory about the nature of weights in logistic mixer,
    and that theory provides an explanation for results like yours, which I mentioned.
    To test it, we'd need
    (1) mix() that would give the same results as mix2() if we'd normalize its weights (ie w'[i]=w[i]/wsum)
    (2) mix2() with a tuned constant coef
    (3) maybe investigate the behavior of wsum and explicitly implement it for mix2, with parameters -
    the difference is likely similar to freq pair vs linear counter.

    In any case, I think that even if you have an algorithm which works better for unknown reasons,
    it doesn't mean that it would be better in any other framework too.

  7. #7
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I did research about something related. Here comes the math...


    1. PAQ derivation

    First of all, the paq mixture is the unique minimizer of

    f(q) = w_1 D(q||p_1) + ... + w_m D(q||p_m)

    where D(q||p) is the KL-divergence between PDFs q (source distribution) and p (model distribution); p_1, ..., p_m are the model distributions (i.e. predictions); and w_1, ..., w_m are non-negative weights where at least one is non-zero. All of these PDFs are over a binary alphabet.

    Finally, the minimizer q* of f(q) is given by

    q*(1)
    = [ prod i=1..m p_i(1)^w_i' ] / [ sum x in {0, 1} prod i=1..m p_i(x)^w_i' ]
    = squash( sum i=1..m w_i * stretch(p_i(1)) )
    (probability of bit 1)

    where w_i' = w_i/(w_1 + ... + w_m). Thus weights are *forced to sum to one* (and by the assumption are non-negative) by the nature of the above problem.

    Minimizing f(q) has the following sense: Given model distributions p_1, ..., p_m and weights w_1, ..., w_m as a measure of the performance/importance of the model PDFs find the "closest" source distribution. Here we measure proximity via the KL-divergence (expected code length excess).

    More details: "Mixing Strategies in Data Compression", https://sites.google.com/site/toffer...ready-full.pdf


    2. Performance analysis

    Now suppose for an input sequence x_1, ..., x_n with inputs p_1,k, ..., p_m,k where k=0,1,...,n you want to find a "good" (in some sense) sequene of weight vectors w_1, w_2, ..., w_n using online gradient descent (OGD) s.t. the weight in step k may onl depend on steps 1, ..., k-1. (This is what we usually do and want.) Let w* be the the best static weight vector (i.e. w_1=w_2=...=w_n=w* minimizes the total code length).

    If you use a constant step size for OGD you have a uniform bound (for any input sequence and any squence of model distributions):

    (1) code length of OGD scheme <= A * (code length of best static weight vector w*) + O( |w_1-w*| * A^2/(1-A) )

    The constant A depends on the step size. The smaller the step size, the closer A is to one. The problem with the above approach is that |w_1-w*| may be unbounded, if you don't bound the weights. Now if you do bound the weights (details omitted, something like min(1, max(0, w_k)) suffices) the additive redundancy (the big-Oh-term in (1)) is bounded. If you bound the weights to lie in some superset (like -1 <= w<=1) of the probability simplex (weights are non-negative, summing to one) this has two side effects:
    i. the worst-case redundancy increases (more freedom, more chance to do something wrong)
    ii. the best-case redundancy decreases (under some conditions it might even becom negative!)
    iii. emperically no dramatic changes

    Now someone might say: (1) is a dumb bound. In practice we'll be better anyway since we will adapt to local changes. This is can be confirmed by experiments - and theoretically, since the following holds:

    (2) code length of OGD scheme <= A * sum_i=1..s [code length if i-th input subsequence starting at t_i, ending at t_(i+1)-1] + O( |w_i-w_i*| * A^2/(1-A) ).

    This means that the OGD scheme does not perform "much worse" than a sequence of s comparision vectors w_i*, where w_i* minimizes the code length of the i-th subsequence (the string x_(t_i), ... x_(t_(i+1)-1) with model distributions p_1,k, ..., p_m,k where k = t_i, (t_i)+1, ..., (t_(i+1))-1). Both, the number s of sequences and their locations 1=t_1<t_2<...<t_(s+1)=n+1 are arbitrary. Hence you can take the minimum over these entities.

    More details: "Geometric and Linear Mixtures - Analysis", https://sites.google.com/site/toffer...ready-full.pdf

    The above is written down from memory. There might be small quirks like missing/wrong indices.

    I hope this helps.


    EDIT: the above assumes that in each time step k you have 0< B<= p_1,k(x), ..., p_m,k(x) <= 1-B < 1, i.e. the PDFS have probabilities which are bounded from below by a constant B>0 and from above by 1-B.

    Cheers
    Last edited by toffer; 15th February 2013 at 11:40.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I think it makes sense for the model weights to be non-negative and add to 1 if the input models are well calibrated. But they might not be. Suppose you have a model that outputs p(1)=0.1 when the next bit is 1 and p(1)=0.9 when the next bit is 0. Then the ideal weight is negative. More practically, you might have a "weak" model that outputs p(1)=0.51 for 1 or p(1)=0.49 for 0. Then the ideal weight is much greater than 1. If you mix either of these models with a second model that outputs randomly or always outputs 0.5, then the ideal weight for the second model is 0, so this also holds for the summation.

    As another experiment, consider mixing a simple order 0 model and an order 1 model. Here we use direct, stationary context models on the Calgary corpus. The final "t" indicates a MIX2, which has weights constrained to the range 0 to 1 and must sum to 1. "m" is a MIX which has neither constraint.

    zpaq a x calgary -v 0 -m 4nc256c256,0,255t -> 1382194
    zpaq a x calgary -v 0 -m 4nc256c256,0,255m -> 1367527

    (4 means 2^4 MB blocks, n means no e8e9 filter, c256 means a CM with maximum count limit (256-1)*4 and no bytewise context, c256,0,256 means a CM likewise with no periodic or column context and a context byte mask of 255 (1 byte)).

    Here is the same experiment using indirect context models (c0 = ICM = context -> bit history -> prediction).

    zpaq a x calgary -v 0 -m 4nc0c0,0,255t -> 1277393
    zpaq a x calgary -v 0 -m 4nc0c0,0,255m -> 1271825

    Here is the same after a BWT (3 = BWT with 16 MB blocks).

    zpaq a x calgary -v 0 -m 3nc0c0,0,255t -> 804253
    zpaq a x calgary -v 0 -m 3nc0c0,0,255m -> 799110

    However, it is not always true that a MIX outperforms a MIX2. In a sequence ending like "mst" (MIX, SSE, MIX2), or "mm16t" (MIX with order 0 and 1 contexts, MIX2) a final MIX2 compresses better than a 2 input MIX. In these cases, the inputs to the MIX2 are presumably accurate. The higher level models therefore use by default "mm16tst" (m also takes all earlier components as input, but t takes only the last 2).

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    OK, that sheds some light. Thans toffer & Matt.

    I think it makes sense for the model weights to be non-negative and add to 1 if the input models are well calibrated. But they might not be. Suppose you have a model that outputs p(1)=0.1 when the next bit is 1 and p(1)=0.9 when the next bit is 0. Then the ideal weight is negative. More practically, you might have a "weak" model that outputs p(1)=0.51 for 1 or p(1)=0.49 for 0. Then the ideal weight is much greater than 1. If you mix either of these models with a second model that outputs randomly or always outputs 0.5, then the ideal weight for the second model is 0, so this also holds for the summation.
    I would call those two models broken or at least not adapting at the right rate.

    Assuming that we have enough models so that at least one is predicting well at every situation, we can avoid negative or higher than 1 weights. With enough models it should be sufficient to give the wrong model a weight of 0 and the well behaving model a weight close to but lower than 1. "Weak" model predictions should rather be "strengthened" by some APM I think. Overall having weights outside of range (0,1) and not summing to 1 doesn't really correspond to the term "mixing" for me.

    Also you have 2-input restricted mixer and N-input unrestricted mixer but you don't have N-input restricted mixer. I wonder what would happen if we restrict the N-input mixer. Maybe I'll try it.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    In my experiment, all mixers have 2 inputs. Also, when I developed ZPAQ I initially experimented with weights restricted to the range 0 to 1 like in PAQ (but not required to add to 1). I found that the compression was not as good as when I allowed the weights to range from -8 to 8. I would have preferred 0 to 1 because you need 16 bits of precision for good compression (in my experiments), which would have allowed a faster SSE2 implementation of the MIX using 16 bit saturated arithmetic. Unfortunately I had to use 20 bit weights to get good compression. I implemented both the predict and update functions in SSE2, but found that the update function was faster in ordinary x86 or x86-64.

    It's not that you need 16 bit fractional parts to compute the prediction, but you do to make the update steps small enough. Using a 12 bit fraction could probably be done using randomized updates I guess. I suspect it would be slower. But 20 bits still allows for SSE2 prediction like this:

    weight (20 bits) x input (12 bits) -> 32 bits
    drop low 8 bits
    sum up to 256 components -> 32 bits.
    drop 20 bits.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    So if I understand toffer's paper, it is necessary to put bounds on the weights to prove bounds on coding cost.

  12. #12
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    So if I understand toffer's paper, it is necessary to put bounds on the weights to prove bounds on coding cost.
    In the paper i only consider the case where the weight space is a compact (= closed and bounded) and convex subset of the n-dimensional euclidian space. This implies that (euclidian) distances |u-v| between weights u and v, are bounded.
    (There is some constant C s.t. for arbitrary pairs u, v of weights we have |u-v|<=C.) I use this constraint for several reasons:
    i. It is realistic and practical. In every implementation the weight space has to be bounded (finite memory/register size).
    ii. Redundancy terms such as O(|w_1-w*|) from my previous post (or Prop. 3.3 from the paper) are O(1). Hence the redundancy is bounded and cannot be arbitrarily large.

    On the other hand, the same proofs apply to the unbounded case. (If there is demand i can give some details.) However, ii. isn't satisfied - which i consider a problem.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    So the paper says that GEO (PAQ7- always does at least as well as LIN (PAQ6)?

  14. #14
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    So the paper says that GEO (PAQ7- always does at least as well as LIN (PAQ6)?
    Almost. The worst-case bounds (Th. 4.7) for GEO are always better than for LIN. To say that the actual worst-case code length is always lower one needs to find an example input I=(x_1, ..., x_n; p_1,k, p_m,k for k=1,...,n) where:
    code length bound GEO <= actual code length LIN for I <= code length bound LIN.
    I didn't figure out such an input yet.

    AFAIR LIN isn't PAQ6:
    I defined linear mixture PDF p of m PDFs p_1, ..., p_m as: p(x) = w_1 * p_1(x) + ... + w_m * p_m(x) (mixed probability for letter x) where w_1, ..., w_m >= 0 and w_1 + ... + w_m = 1.
    I call this linear mixing because the output PDF is a linear combination of input PDFs.
    AFAIR in PAQ6 you use something different: a = a_1 * w_1 + ... a_m * w_m (mixed count for bit 1), b = b_1 * w_1 + ... b_m * w_m (mixed count for bit 0) and p(1) = a/(a+b).

    Comparing LIN and GEO where weights are found with OGD the key messages are: For both, LIN and GEO,
    i. we can provide code length guarantees (worst-case) for OGD compared to the best static weight vector (Prop. 3.3);
    ii. we can probide code length guarantees (worst-case) for OGD compared to a sequence of optimal weight vectors (Th. 3.7);
    iii. we can probide code length guarantees (worst-case) for OGD compared to a sequence of model distributions (Th. 4.7).

    iii. actually is a weaker refinement of ii.; both ii. and iii. guarantee that we will prefer models (predicted PDFs) locally which provide better performance than other models if the change in compression is "dramatic" enough.
    And interestingly the redundancy terms of GEO in Table 1 are O(B^2) vs. O(4^B/B) for LIN where the smallest probability for any letter in the input PDFs for mixing is 2^-B. This indicates (this is no proof though) that the worst-case code length of GEO w.r.t. a large range of input probabilities if exponentially lower than for LIN.

    EDIT: Here i always assume that weights for both, LIN and GEO, are drawn from the probability simplex (non-negative, summing to one).
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yes, you're right that PAQ6 is not LIN. It is a weighted sum of bit counts, which has the effect of giving more weight to predictions near 0 or 1 (like GEO).

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I wonder how GEO performs on (near-) uniform distributions. A probability (close to) 0.5 after stretching is (close to) 0.

    Suppose that at some point we have a stream segment with uniform distributions. We have also two models:
    - model A which was predicting poorly, but now is predicting well on the flat distribution,
    - model B which was predicting well, but now is predicting worse than A,

    In train() function weight delta depends on stretched probability, so weight for model A won't improve (bad?), but weight for model B will decrease (good). Shouldn't the weight for model A increase?

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    If model A is predicting well, then it will output a positive number (a) when the next bit is 1 and negative when 0. If model B is predicting poorly then it will output a number b independent of what bit is next. The prediction is p = squash(wa*a + wb*b) where squash(x) = 1/(1+exp(-x)). Then the prediction error is -p if the next bit is 0 or 1-p if the next bit is 1. Then the weights are updated: wa += a*error*rate, wb += b*error*rate, where the rate is typically around 0.001 to 0.01.

    The update of wa and wb depends on the prediction error, which depends on both a and b. Therefore the effect of a high error rate for b will increase the update rate of wa.

    What happens to wb depends on how b behaves. If b is always 0 (prediction = 1/2), then wb is not updated. But this does not matter because b has no effect on p. If b is random or fixed at some nonzero value, then because squash() is nonlinear, the effect on p will be larger when b is wrong than right and wb will decrease.

    Note that if we only have a single model with perfect prediction (like a = +1 or -1 depending on the next bit and b = 0), then the weight wa will grow without bound (as wa ~ ln(n*rate) after n predictions).

  18. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Here is what happens to the weight of a single perfect predictor with update rates of 0.1, 0.01, and 0.001 after n=1..2^30 predictions. The numbers in parenthesis are the difference from ln(n*rate + 1).

    Code:
    #include <stdio.h>
    #include <math.h>
    int main() {
      double w1=0, w2=0, w3=0;
      for (int i=1; i<1100000000; ++i) {
        w1+=0.1/(1+exp(w1));
        w2+=0.01/(1+exp(w2));
        w3+=0.001/(1+exp(w3));
        if (!(i&i-1))
          printf("%10d %10.6f %10.6f %10.6f (%10.6f %10.6f %10.6f)\n",
            i, w1, w2, w3, w1-log(i*0.1+1), w2-log(i*0.01+1), w3-log(i*0.001+1));
      }
      return 0;
    }
    
    
             1   0.050000   0.005000   0.000500 ( -0.045310  -0.004950  -0.000500)
             2   0.098750   0.009988   0.001000 ( -0.083571  -0.009815  -0.000998)
             4   0.192633   0.019925   0.001999 ( -0.143839  -0.019296  -0.001993)
             8   0.366860   0.039652   0.003997 ( -0.220926  -0.037309  -0.003972)
            16   0.668414   0.078518   0.007985 ( -0.287098  -0.069902  -0.007888)
            32   1.132554   0.153958   0.015938 ( -0.302530  -0.123674  -0.015561)
            64   1.743509   0.296139   0.031749 ( -0.257971  -0.198558  -0.030286)
           128   2.438299   0.549381   0.062995 ( -0.186369  -0.274795  -0.057451)
           256   3.159808   0.957576   0.124008 ( -0.121104  -0.312185  -0.103925)
           512   3.881147   1.525894   0.240359 ( -0.073936  -0.285668  -0.173075)
          1024   4.595130   2.202298   0.452262 ( -0.043475  -0.217181  -0.252814)
          2048   5.301957   2.921540   0.807021 ( -0.024948  -0.145582  -0.307465)
          4096   6.003548   3.646199   1.326941 ( -0.014072  -0.090518  -0.301515)
          8192   6.701714   4.364046   1.976350 ( -0.007834  -0.053831  -0.241983)
         16384   7.397769   5.073849   2.687668 ( -0.004317  -0.031127  -0.167882)
         32768   8.092569   5.777429   3.413006 ( -0.002359  -0.017655  -0.106507)
         65536   8.786643   6.476837   4.133631 ( -0.001280  -0.009872  -0.064112)
        131072   9.480303   7.173635   4.845982 ( -0.000690  -0.005459  -0.037366)
        262144  10.173732   7.868869   5.551389 ( -0.000370  -0.002991  -0.021313)
        524288  10.867033   8.563190   6.251979 ( -0.000198  -0.001627  -0.011968)
       1048576  11.560263   9.256990   6.949501 ( -0.000105  -0.000879  -0.006640)
       2097152  12.253455   9.950496   7.645163 ( -0.000056  -0.000472  -0.003649)
       4194304  12.946626  10.643839   8.339732 ( -0.000029  -0.000253  -0.001989)
       8388608  13.639786  11.337092   9.033672 ( -0.000015  -0.000135  -0.001077)
      16777216  14.332940  12.030297   9.727257 ( -0.000008  -0.000071  -0.000580)
      33554432  15.026090  12.723475  10.420644 ( -0.000004  -0.000038  -0.000310)
      67108864  15.719240  13.416638  11.113921 ( -0.000002  -0.000020  -0.000166)
     134217728  16.412388  14.109794  11.807138 ( -0.000001  -0.000010  -0.000088)
     268435456  17.105535  14.802946  12.500323 ( -0.000001  -0.000005  -0.000047)
     536870912  17.798683  15.496095  13.193490 ( -0.000000  -0.000003  -0.000025)
    1073741824  18.491830  16.189244  13.886648 ( -0.000000  -0.000002  -0.000013)
    Last edited by Matt Mahoney; 22nd February 2013 at 21:11. Reason: code improvement

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Here is a simulation of a perfect predictor which outputs +1 or -1 and a 1 input mixer with a learning rate of R=0.01. The first column is i, the number of compressed bits. The second column is the mixer weight, w. The third column is the prediction error, p. The fourth column is the total compressed size so far in bits. The last column is the compressed size of the last i/2 bits, which converges to 1/R.

    Code:
    #include <stdio.h>
    #include <math.h>
    int main() {
      double w=0, R=0.01, h=0, h1=0;
      for (int i=1; i<1100000000; ++i) {
        double p=1/(1+exp(w));
        h-=log(1-p);
        w+=p*R;
        if (!(i&i-1)) {
          printf("%10d %10.6f %10.6f %12.6f %10.6f\n",
                 i, w, p, h/log(2), (h-h1)/log(2));
          h1=h;
        }
      }
      return 0;
    }
    
          bits     weight      error         size  last half
             1   0.005000   0.500000     1.000000   1.000000
             2   0.009988   0.498750     1.996398   0.996398
             4   0.019925   0.496259     3.978458   1.982061
             8   0.039652   0.491316     7.900139   3.921680
            16   0.078518   0.481583    15.577643   7.677504
            32   0.153958   0.462737    30.299327  14.721684
            64   0.296139   0.427548    57.432137  27.132810
           128   0.549381   0.366860   103.975880  46.543742
           256   0.957576   0.277921   175.027346  71.051467
           512   1.525894   0.178858   267.986515  92.959169
          1024   2.202298   0.099634   372.896357 104.909841
          2048   2.921540   0.051124   480.644961 107.748604
          4096   3.646199   0.025433   587.167845 106.522884
          8192   4.364046   0.012568   691.689107 104.521262
         16384   5.073849   0.006220   794.557170 102.868062
         32768   5.777429   0.003087   896.290153 101.732984
         65536   6.476837   0.001536   997.305654 101.015500
        131072   7.173635   0.000766  1097.888087 100.582433
        262144   7.868869   0.000382  1198.216792 100.328705
        524288   8.563190   0.000191  1298.399958 100.183166
       1048576   9.256990   0.000095  1398.500975 100.101017
       2097152   9.950496   0.000048  1498.556211 100.055236
       4194304  10.643839   0.000024  1598.586196 100.029985
       8388608  11.337092   0.000012  1698.602372 100.016177
      16777216  12.030297   0.000006  1798.611053 100.008681
      33554432  12.723475   0.000003  1898.615690 100.004637
      67108864  13.416638   0.000001  1998.618157 100.002467
     134217728  14.109794   0.000001  2098.619464 100.001307
     268435456  14.802946   0.000000  2198.620155 100.000691
     536870912  15.496095   0.000000  2298.620519 100.000364
    1073741824  16.189244   0.000000  2398.620710 100.000191

  20. #20
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks for the analysis. But I've been worried about a case where input is random and by "predicting well" I've meant outputting a prediction p near or equal to 1/2.

    But after some thinking I realized that t[i] which is the stretched prediction, will be very low for the well behaving model (ie those that outputs predictions near 1/2, which after stretching are near 0) and thus it's weight will diminish much more slowly than the weight for a model that is predicting worse (ie randomly and not in correlation with the random input).

    Logistic mixing starts to make much more sense for me now, also the intuitive restrictions like restricting weights to sum to one now looks unnecessary.

  21. #21
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Now I have somewhat more difficult question.

    As you know I'm working on Demixer and I'm using suffix tree (of limited depth) to retrieve statistics. Thus I can have a lot of statistics (currently only trimmed bit histories) at once, but their quantity is variable. Because their quantity is variable I cannot use a single N-input mixer with restriction that weights sum to 1. So I'm left with either a chain of mixers or a unrestricted mixer.

    If I choose to have a chain of 2-input mixers (restricted or not) then for a single place in chain I can have a set of mixers to choose from (selecting by a context). I don't know if it would make sense for N-input mixer, ie a mixer which has a separate input for each order, but instead of a single vector of weights I would choose each weight based on a small context which is independent of contexts of weights for other orders. AFAIU the inputs in unrestricted mixer are fairly independent so it shouldn't matter if I have a set of weights per each input, provided that I don't use a single weight variable for different inputs. Also it shouldn't really matter if I disable (or don't have information about) some inputs sometimes. Am I right?

    Also I have an impression that with a chain of 2-input mixers the error propagation would be rather slow and/ or less smooth if the chain is long compared to a single N-input mixer.

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I haven't tried selecting weights for an n-input mixer by different contexts, but it would probably work. For a chain of 2 input mixers, that is the normal thing to do. If you have an n-input mixer and a variable number of models, you could select a mixer according to the number of inputs or else set unused models to 0. I guess you need to do the experiment.

  23. #23
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    From a theoretical POV having a chain of two input-mixers, one for each binary context, is very appealing. I played around a bit with such a model (for each context: two input mixers with OGD and probability estimation via linear counters) theoretically and it turns out that one can obtain superior theoretical guarantees - similar to CTW. (BTW the idea behind PPM is not much different from that.) For the actual value of a weight and a probability estimate for the first k, k is constant, occourences of any context one can use arbitrary heuristics (SSE, information inhertiance, etc.) and the theoretical guarantees still hold. Note that i didn't work out all the math, but everything i found up until now points into the same direction. This could be a happy union of theory and practise. I will implement such a scheme on my own, but i'd be happy if someone else tries. BTW Shelwien used such a model setup, too. (e.g. MIX* compressors on his site)

    If you want some hints on the setup for PDF estimation and mixer updates for each context let me know.

    Cheers
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  24. #24
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks for the information. But according to my quick experiment described in thread about Demixer, long chains of mixers are unstable, at least if there's no additional context to select mixers at each step. I'm thinking about computing something like Levenshtein distance between bit histories for consecutive orders and based on that select or disable mixers at each place in the chain. I'll think about fine tuning the learning rates after I stabilize the predictions in long chains.

  25. #25
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Thanks for the information. But according to my quick experiment described in thread about Demixer, long chains of mixers are unstable, at least if there's no additional context to select mixers at each step. I'm thinking about computing something like Levenshtein distance between bit histories for consecutive orders and based on that select or disable mixers at each place in the chain. I'll think about fine tuning the learning rates after I stabilize the predictions in long chains.
    Just to clarify: You got one PDF estimator per context. For the scheme i mentioned you need one PDF and one mixer per context. And you need to mix "from top to bottom" (first mix the two highest order predictions). E.g. for mixing just orders 0 and 1 you have (at most) 255 + 256*255 PDF estimator and mixers.

    And i don' think a proper implementation performs poorly. Compressors like PPMonstr use similar techniques - although they don't decompose the source alphabet - and perform pretty well.

    EDIT: If you are interested i can provide a simple framework for parameter tuning. I suggest to have a look at genetic algorithms if you want to do that on your own.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  26. #26
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Just to clarify: You got one PDF estimator per context. For the scheme i mentioned you need one PDF and one mixer per context. And you need to mix "from top to bottom" (first mix the two highest order predictions). E.g. for mixing just orders 0 and 1 you have (at most) 255 + 256*255 PDF estimator and mixers.
    I don't get it - what's PDF estimator here? How it's different from mixer itself? And you mean that I need to have separate mixer for each context and not each context order?

    Is there any reason that mixing "from top to bottom" should give better results than mixing in reverse order, given proper tuning of course?

    EDIT: If you are interested i can provide a simple framework for parameter tuning.
    Well, if you have a ready one then I'll happily look at it.

    I suggest to have a look at genetic algorithms if you want to do that on your own.
    Sure, but I want to play with various transforms, structures and schemes first.

  27. #27
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't get it - what's PDF estimator here?
    OK, we consider the binary case. For every binary context you model the probability of a one-bit and zero-bit, in other words a probability density function (PDF) over a binary alphabet. This is a PDF-estimator.

    How it's different from mixer itself?
    See above.

    And you mean that I need to have separate mixer for each context and not each context order?
    Yes. I mean a separate mixer for every context - and not just each order.

    Is there any reason that mixing "from top to bottom" should give better results than mixing in reverse order, given proper tuning of course?
    Theoretical reasons. ATM my math here is rather sketchy and consists of pencial and paper work only. I need to recheck this. But it looks plausible to me.

    Well, if you have a ready one then I'll happily look at it.
    I've recently rewritten and simplified my optimization framework. To optimize custom targets you need to:
    1. provide a parameter definition (see example/definition)
    2. provide a C function which returns the cost funtion value (see example/rastrigin.cpp)

    I use Code::Blocks. To compile by hand with gcc:
    g++ -o opt.exe -Iexample optimize\optimize.cpp example\rastrigin.cpp
    (This builds the optimizer to minimize the Rastrigin function with parameter 2, i.e. in two dimensional space)
    Then run opt.exe to see the options. Run e.g. "opt alg=dcga,pop=50,iter=100,miss=100,pm=0.06,log=log. txt,dst=pop.txt" to optimize for 100 iterations with specified GA options and logging to log.txt and parameter recording to pop.txt (one can resume optimization from such record files).

    To optimize your own stuff do:
    1. create directory "mystuff"
    2. create parameter profile "mystuff/definition"
    3. create cost function with driver program: "mystuff/mycost.cpp"
    4. g++ -o opt.exe -Iexample optimize\optimize.cpp mystuff/mycost.cpp

    Note that the provided files are experimental and not well-tested, i.e. there might be bugs. At least it passed my tests. Some functionality like multi-threading is missing here.
    Attached Files Attached Files
    Last edited by toffer; 23rd February 2013 at 17:07.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  28. #28
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    The optimizer program looks somewhat convoluted. AFAIU it's meant for direct integration with C++ sources, yes? I'll need to spawn new process per each optimization trial then and somehow pass the parameters.

    Yes. I mean a separate mixer for every context - and not just each order.
    Ouch, that seems like an overkill. And that would be incompatible with the suffix tree with compressed edges as with CTW-like weighting there would be multiple different mixer weights even on non-branching edges. Thus the tree could not have a linear size then. With bit histories it's a different story because bit histories are identical for every node on compressed edge and the number of branches in a suffix tree is linear.

    I've took a look at CTW results and they aren't overly impressive. Shelwien did a bytewise mixing compressor http://encode.ru/threads/541-Simple-...xt-mixing-demo but that also didn't score well (worse than CTW) on book1. One would expect that bytewise coders should do very good on byte-oriented data like book1.

  29. #29
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The optimizer is perfectly separable - however i prefer it this way (integrated with C++ sources) since i don't need to write parsers for loading parameter configurations - the code is automatically generated.
    You don't need to spawn a process per trail. And you don't need to take care of where the parameters come from. You just need to supply a function:

    double F( const char* x ) {
    PROBLEM_NAME p=x;
    // compute cost function value y
    return y;
    }

    and declare the contents of PROBLEM_NAME via the definition file.

    As to CTW...

    Well, CTW implementations are dated and don't use any modern heuristics.

    As i said PPMonstr uses some modern heuristics and works similar. And its performance is pretty good.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  30. #30
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    I've tried mixer-chains but they where not better than the traditional approach.

    One should not forget that the current math for PDF-estimation assumes memoryless subsources which contradicts with our usage of state-based-counters.
    I have a PDF written for some subject in University about limited-order markov compression. Apparently it is in german.

    I've also written a large optimizer framework which can work with any program which is controllable through a txt-based configuration file.
    You can select between different algorithms like DDS, DE and ASA. DDS gives fairly good estimations for around 1000 function evaluations for dimensions >30.
    Last edited by Sebastian; 24th February 2013 at 12:30.

Page 1 of 2 12 LastLast

Similar Threads

  1. Updating the linear mixer with BFA
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 13th November 2010, 14:52
  2. Paq mixer theory
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 22nd November 2009, 02:32
  3. Mixer vs 2D SSE
    By Shelwien in forum Forum Archive
    Replies: 26
    Last Post: 8th March 2008, 13:58
  4. New lpaq1 version
    By Matt Mahoney in forum Forum Archive
    Replies: 21
    Last Post: 29th October 2007, 00:35
  5. lpaq1 is here
    By Matt Mahoney in forum Forum Archive
    Replies: 10
    Last Post: 31st July 2007, 18:06

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •