Results 1 to 12 of 12

Thread: TarsaLZP

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts

    TarsaLZP

    I'm starting a thread about TarsaLZP again. Previous thread, http://encode.ru/threads/1449-Adaptive-adaptation-rate-) doesn't have appropriate name anymore

    I've experimented with cost estimation and I've opted not to estimate the weighted cost, but weighted difference from minimal cost. With such change estimator makes better use of o1 models. The things I've tried was also making SSE context of eg costs order history or interval lengths estimations comparison, but none were good.

    I've experimented with state tables and found some good generating formula. In attachment there is old Java port of TarsaLZP (those with crappy o2 coder), but with newer FSM and increased memory usage.

    The results of that version are:
    enwik8 - 24751389 bytes - 24314 ms
    enwik9 - 208867187 bytes - 202918 ms

    Machine is:
    Core 2 Duo E8400, 8 GiB RAM, Ubuntu 11.10 64-bit, OpenJDK 7 from repository.

    PS:
    If you're unable to run jTarsaLZP due to OutOfMemoryError (weirdly it sometimes requires very big heap to run properly) try lowering the order-8 LZP model. Instead of
    Code:
    final int lzpLevel8 = 30;
    try lower values. One slot takes 2 bytes of space, there are 2^lzpLevel8 slots in LZP order-8 model.
    Attached Files Attached Files
    Last edited by Piotr Tarsa; 29th January 2012 at 21:02.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I posted your results. I didn't have exact memory requirements or decompression time. I tried it on 32 bit Windows but get an out of memory exception. I didn't try modifying the source. http://mattmahoney.net/dc/text.html#2088

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks. Java has some weird requirements when it comes to memory allocation. I had OoME even with 3 GiB heap, although I've allocated only slightly above 2 GiB.

    I'm working on a C version now (on the current weekend I've worked on FSMs), so in a few days there should probably be a new version (based on my recent PPM coder) with a working LZP layer.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts

    Mixing LZP predictions

    I have some philosophical problems.

    Recall that in past TarsaLZP versions I've maintained two completely separate LZP models, one of order-4 and one of order-8. I've been simply switching between them, depending on which one gave higher probability of a match, where match length is always 1. That was rather simple.

    What I want to do now is to use both predictions if both are available and different. The scheme is as follows:
    1. if any LZP model has prediction at current position then encode a flag telling if byte was matched by either one model,
    2. if byte wasn't predicted correctly by either one model then its encoded by lower layers - currently PPM layer with full symbol exclusion,
    3. if both LZP models had a prediction and they were different then an additional flag is needed to make a distinction between them,
    Each model predicts at most one value at once (well, that's what LZP is about).

    Consider step 1. when both models have predictions.

    Situation 1a:
    Order-4 model tells that 'a' has probability of 0.7. Order-8 model tells that 'b' has probability of 0.5. Overall it seems that the probability that neither model is corrent is negative, but that's impossible. How to mix those probabilities? If models predict different values, then it seems obvious that probability that either one is correct is higher than the higher predicted probability, ie in considered case the higher probable symbol is 'a' and has probability of 0.7, so probability that actual symbol is either 'a' or 'b' must be higher than 0.7.

    Situation 1b:
    Both models predicts the same value. Say order-4 model tells that 'a' has probability of 0.7 and order-8 tells that 'a' has probability of 0.5. I suspect that simple mixer with fpaq0p-like updates (ie by a fraction of error) would be sufficient, but maybe not.


    If preditions are different then we have to encode an additional flag to make distinction.

    Situation 2:
    Order-4 model predicts that it will win with probability 0.99. Order-8 model predicts that it will win with probability 0.9. What's the resulting (mixed) probability?


    Additionally I would rather make an adaptive mixer, not a fixed one, as it seems probable than some regions of file would benefit more from order-4 model and some from order-8 model. And I want it to be reasonably fast, I wouldn't want new TarsaLZP to be significantly (eg over 2x) slower than older dual-LZP-models versions.

  5. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Maybe it would be easier to study first the situation when both predictions re-enforce each other, e.g. they predict the same winner.
    There may be some easy gains just from exploiting this situation.

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Well, I would want to have a clear picture over all situations.


    Let's consider a few more examples (probability of first flag == probability that incoming symbol is matched by either LZP model):


    3. A memoryless generator, which generates bytes. p['0'] = 1/ 3 - epsilon, p['1'] = 2/ 3, p[other] = epsilon/ 254. With such source, the expected behaviour is surprising. When predicted symbols are different then no matter what are the predictions from LZP models, the probability that either one model is matching should be very close to 1 (so should be the probability of the first flag). Then goes the distinction flag - as LZP models doesn't care about actual symbol values, they model only matches, not values, it would be difficult or impossible to compute the correct probability ration between them. Also if both LZP models give the same prediction, then we should encode first flag with either 1/ 3 or 2/ 3 probability, depending on the value the LZP models predict - but as I said, in general LZP models doesn't care about actual values, only match histories.

    4. A memoryless generator with alphabet of size 3 and all symbols equally probable. Then LZP models will always predict that they have the correct value with 1/ 3 probability, so when the predictions from both models are different then the first flag should have 2/ 3 probability and when they are the same then first flag should have 1/ 3 probability. Also when it comes to distinction flag - it should always give equal probabilities to both models.


    I will have more time to think on it on the weekend, but it still looks somewhat hopeless for me.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Probability theory doesn't tell you how to combine predictions. I would use a table to input the 2 predictions and whether or not they differ. The table output would be a new probability distribution (A, B, other). Then after coding, update the distribution to reduce the prediction error.

    You will probably have to experiment with table size and quantization, and whether or not to interpolate the output between bins. For SSE, I found that stretching the probabilities (p -> ln(p/(1-p)) and quantizing between -8 and 8 in steps of 1/2 works with interpolation. Without interpolation, I would probably use steps of 1/4. The adjustment should be something like error/count with a maximum count like 1K. I used a design like this in ZPAQ ICM/ISSE where I packed a 22 bit prediction and 10 bit count into a 32 bit table value. But your tables should be small enough not to bother.

    The other possibility is to use weighted averaging of stretched predictions and adapt the weights like in a ZPAQ mixer. If the predictions are the same, then it is just w1*p1 + w2*p2. If the predictions differ, then you need 2 mixing steps, one for the (A or B) decision bit and one to distinguish them.

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    There's a problem with stretching.

    In usual bitwise models there are p(0) and p(1) respectively. When they're close to each other then (almost) no compression is done, so when stretching we want to give higher weights to probabilities far from 0.5.

    In my LZP models there are p(match) and p(miss) probabilities, where p(match) corresponds to a probability estimation for exactly one symbol and p(miss) corresponds to a probability estimation for effectively unknown number of symbols. For example, when coding a DNA sequence there's alphabet of size 4 with rather random distribution so p(miss) would roughly correspond to probability sum of three symbols. Therefore it's not easy which probability ranges should be given higher weight when mixing. Certainly I should prefer a model which gives higher probability of a match, just like in past TarsaLZP version with simple switching.

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Using tables, the only purpose of stretching the inputs is for non-uniform quantization. You wan't finer resolution near 0 and 1. Tables should give you the best results for large files, or mixers for smaller or highly non-stationary files.

    You have to code 1 or 2 bits, depending on whether the LZP predictions are the same (which the decoder knows). The first bit says there is a match, and then if the 2 predictions are different you code a second bit to say which one. You have a different table for each case, 3 in all. Each table inputs p1 and p2 and outputs a new prediction p = t[p1][p2]. After each bit is coded, you update the table entry by error/count where error = bit-p. If both LZP predictions are wrong, then you code the byte with exclusions.

    You can replace any of the 3 tables with a mixer: p = squash(w1*stretch(p1)+w2*stretch(p2)). After coding a bit, you update each of the weights, something like w1+=0.001*error*stretch(p1) and likewise for w2 and p2. You could also try adding a bias like p=squash(w1*stretch(p1)+w2*stretch(p2)+w3). In this case, squash and stretch improve compression because they give greater weight to probabilities near 0 and 1 and because you don't need to constrain the weights to keep p between 0 and 1.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. Obviously it only makes sense to mix the same kind of probabilities.
    In this case, the LZP model gives a probability for one specific symbol,
    and (I suppose) there's a literal model which provides predictions for all of them.
    So the LZP model prediction for symbol A can be only mixed with literal
    probability of symbol A.
    And, well, its actually simple if we'd ignore the speed optimization.
    PA - LZP model's probability of symbol 'A' occuring (match flag prob.)
    p[] - literal model's distribution. p[A] = probability of same symbol A.
    p = mix(PA,pA) -- new refined probability of symbol A
    q[] - mixed distribution, used for actual coding
    Code:
    for( i=0; i<CNUM; i++ ) q[i] = (i!=A) ? p[i]/(1-p[A])*(1-p) : p;
    2. Having multiple LZP models in such a case certainly makes it more
    complicated, but there's nothing really special.
    Basically, we can do the renormalization like above twice for
    two symbols given by LZP models.
    Or, maybe, even add an extra handler for a case when A==B - mix
    3 predictions there (p[A],PA,PB) and only do renormalization once.

    3. Now, let's consider practical implementations

    a) The approach described above actually can be directly applied
    if we have p[] in the form of a binary tree
    Code:
    // c = actual symbol
    for( i=0,ctx=1; i<8; i++ ) {
      if( ctx!=((256+A)>>(8-i)) ) break; // bit prefix mismatch
      bit = (c>>(7-i))&1;
      bitA = (A>>(7-i))&1;
      pr = mix(p[ctx],bitA?1-PA/pp:PA/pp);
      bit = rc_Process( pr, bit ); // rc encode/decode
      pp *= bit ? (1-pr) : pr;
      ctx += ctx + bit;
    }
    for(; i<8; i++ ) {
      bit = (c>>(7-i))&1;
      bit = rc_Process( pr, p[ctx] ); // rc encode/decode
      ctx += ctx + bit;
    }
    b) Its possible to encode the c==A flag separately
    (with a mixed probability), followed by masked literal
    coding similar to the loops above.
    Well, like lzma does.

    c) We can make a LZP model which would output 8 flags per
    byte (byte prefix match flags) and estimate their probabilities.
    In such a case it would turn into plain binary mixing.

    d) Like ppmd, we can use symbol ranking and unary coding.
    In such a case, its really simple to fix up the probabilities
    of any specific symbols, but unfortunately the "worst case"
    behavior of unary schemes is really bad.

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    In each LZP slot (there's 1 slot per hashed context) I'm keeping:
    - tag (with checksum and failure count) - irrelevant for modelling considerations,
    - 8-bit state with usual characteristics ie. not having high balanced counts (eg 20 matches and 20 misses),
    - last seen byte for that slot,

    Those states will be then mapped to probabilites. But as you both said, mixing different kinds of probabilites doesn't make much sense I'll probably avoid that. Probably I'll map 8-bit states from both LZP models combined (into a 16-bit concatenated state) into a something different. Maybe probability/count pair but maybe better into another kind of state, eg 12-bit one? However I would have to make a ternary FSM and that would be tricky. Probability/ count pair, I'm afraid, will both be too unstable for small counts and too slow for high counts.


    In the meantime I'm attaching current snapshot of TarsaLZP, with implemented ternary coding, but with fixed low-precision hand-tuned probabilites So there's little point in benchmarking that. I've reimplemented also context confirmation based on my old notes archived in this forum - kudos for encode (and Shelwien?) for maintaining this forum
    Attached Files Attached Files

  12. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I've tried a simple approach of combining different LZP predictions using combined states mapping but that proved suboptimal. Seems that either combining different predictions from LZP models doesn't make much sense or I've overlooked something. Anyway currently it's worse than old TarsaLZP versions with crappy o2 coder, so it's unacceptable.
    Attached Files Attached Files
    Last edited by Piotr Tarsa; 5th February 2012 at 03:18.

Similar Threads

  1. Java port of TarsaLZP
    By Piotr Tarsa in forum Data Compression
    Replies: 19
    Last Post: 8th July 2009, 06:46
  2. need advice on low- order backend of tarsalzp
    By Piotr Tarsa in forum Forum Archive
    Replies: 4
    Last Post: 25th March 2008, 12:03
  3. TarsaLZP
    By Piotr Tarsa in forum Forum Archive
    Replies: 83
    Last Post: 31st October 2007, 19:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •