Results 1 to 21 of 21

Thread: Poor compression of bit-version of PPM

  1. #1
    Member
    Join Date
    Mar 2009
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Poor compression of bit-version of PPM

    Hello,

    I've been experimenting with bit-version of PPM and results so far are
    very frustrating: book1 -> 254278 bytes. With byte version I got ~236K
    straightaway and it was first prototype with almost no tweaking. This
    is a fun-project so I've been reluctant to learn in depth what other
    people did but after quite a bit frustrations with bit-version of PPM I've
    decided to look around for some explanations why bit PPM version shows
    poorer compression. At this stage, I just compute entropy of model
    output so entropy encoding deficiency is not an issue. The explanation
    like: bit-version covers wider class of sources therefore it should be
    a weaker compressor for more specific type of data like ASCII texts is
    the general explanation that I find to be reasonable but still I'm
    missing the _specific_ properties of byte-PPM model what makes it more
    efficient with ASCII texts then bit-version of PPM. I would like to
    avoid to be exposed to any open (C/C++) sources thus the general/
    specific "verbal" explanation would be totally sufficient and truly
    appreciated.

    Stefan

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Isn't bit level unbounded order PPM the same as DMC? In PPM if you have any symbols with zero counts you code an escape symbol and fall back to the next lower context. With just 2 symbols you never code an escape so it is very simple. Hook gets good compression. http://cs.fit.edu/~mmahoney/compression/text.html#1778

    You will notice that the contexts are initialized to start on byte boundaries. When a new context is added, it is by appending the current bit (like bit level LZW), so all new contexts also start on byte boundaries.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. PPM/CM models are not very practical, unless they show <210k
    on book1 - its better to use BWT otherwise.
    Also, a bitwise compressor generally has to use byte-aligned
    contexts for compression of texts (and other common data types),
    and with unaligned contexts its better to replace the test file
    with some jpeg or something else with unaligned bitcodes in it.
    And I'd presume that we're talking about real PPM - which tries
    to encode the symbols (bits or bytes, whatever) using statistics
    from a single context, without mixing etc. (Because with mixing
    its actually hard to get more than 216k).

    2. Bitwise PPM is actually more complex to implement than bytewise one,
    as its much more common with it to get a few contexts of higher order
    than necessary, with random statistics, and determining the optimal
    coding context is quite a problem. Imho the only practical implementation
    of something similar to bitwise PPM is DMC.
    http://en.wikipedia.org/wiki/Dynamic_Markov_Compression

    P.S. Seems that Matt was faster, but I'd post this anyway

  4. #4
    Member
    Join Date
    Mar 2009
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    fuzzy contexts

    >Isn't bit level unbounded order PPM the same as DMC?

    Yes the contexts are not bounded. I've tried (long ago) the original Cormack DMC implementation and my recall is that compression was not impressive at all so I even did not consider it as a contender to start from. Bit version PPM has very simple trie organization and no need for escape probabilities makes it look as a very elegant and simple way to built efficient context model; may be it is somehow an invariant of DMC with similar poor compression though ;o(

    Did someone try to build probabilistic context in such manner that context is represented not by 0/1 but via some state in-between (let say [0...255]) according the 0/1 probability of node state? My model allows to do that, I've tried it and result was not better however plenty places where I could make some stupid mistake; it would be interesting to learn if someone could gets advantages from using of such fazzy contexts.

    >using statistics from a single context

    It is not single in sense that each new node of trie borrows up the statistic from shorter context and updates it accordingly thus, statistics from shorter contexts is accounted; however at the moment of 0/1 prediction the only current (longest context on trie) is used to asses P(0/1).

    >(Because with mixingits actually hard to get more than 216k)

    Quite encouraging ;o(

    Anyway, I just check the byte-aligned contexts you've advised and indeed the compression is getting better.

    Thanks,
    Stefan

  5. #5
    Member
    Join Date
    Nov 2008
    Location
    Germany
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts

    DMC vs. PPM

    Quote Originally Posted by Matt Mahoney View Post
    Isn't bit level unbounded order PPM the same as DMC?
    They are in fact very similar as shown by Bell and Moffat; on the other hand, Bunton has given a counter-example in her PhD theses showing that they are in fact different. The main idea is the following: you can group edges in DMC into two groups: "redirected edges" (edges going to cloned states) and "cloned edges" (edges coming from cloned states) . Now, if one starts with a trivial automaton in DMC and restricts "cloned edges" only, one indeed gets unbounded PPM whereas, if one (as usual) also allows redirection of "redirected edges" one gets something slightly different.

    To summarize, DMC is the more flexible model (e.g. allowing the use of arbitrary initial automata) on the one hand -- but, on the other hand, reality gives all its favors to PPM, since due to its simpler structure, it allows the application of far more optimization techniques at every level of the implementation (e.g. cutting of branches if free memory is exhausted, fast implementations using hashing, ...).

    Alibert

  6. #6
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    It is not fully optimized but with a simple BWTS using straight two state binary so that its a true binary compressor no byte alignment book1
    compress to 235,913 bytes.

    http://bijective.dogma.net/compresa.htm#iBBW

    Some day I will clean it up for better compression.
    But this at least gives another point of comparasion for you where a bit BWT
    coul be compared to a bit PPM

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Yes the contexts are not bounded.

    Then I'd advice to get it working with bounded contexts
    first. Actually I'm not aware of a PPM implementation
    which could properly handle unbounded contexts... their
    compression usually gets worse when you try to increase
    context order after some point (depending on file).

    Code:
    book1
    order ppmd_vJ vJ_sh5 ppmy_3c 3c+SSE
     4    216021  216065 216527  215707
     5    210767  210711 209697  209401
     6    209844  209597 207793  207654
    ppmd_vJ http://compression.ru/ds/ppmdj1.rar
    ppmd_vJ_sh5 http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh5.rar
    ppmy_3c http://compression.ru/sh/ppmy_3c.rar
    ppmy_3c+SSE http://compression.ru/sh/ppmy_3c_sse.rar

    > I've tried (long ago) the original Cormack DMC
    > implementation and my recall is that compression was not
    > impressive at all so I even did not consider it as a
    > contender to start from.

    Yeah, but its old,,, and other implementations might be
    tuned for non-textual data...
    But there's not much choice, if you want a bitwise PPM:
    1. Using bitwise statistics and coding, but bytewise escapes
    2. Only add the "proven" contexts to the tree (DMC way).
    3. Keep all statistics but discard "unworthy" higher orders (LOE etc).
    Also it actually applies to bytewise PPM as well, but there
    its much more probable that the context with maximal order would
    also give the best predictions (for stationary data at least).

    > Bit version PPM has very simple trie organization

    Use hashes if you want it simple - its kinda natural with
    bitwise statistics and any tree is more complicated than
    a hash, especially when you'd try to implement something
    like a sliding window with it.

    > and no need for escape probabilities

    Escapes would still be useful for deterministic contexts
    (where only one bit value was encountered).

    > Did someone try to build probabilistic context in such
    > manner that context is represented not by 0/1 but via some
    > state in-between (let say [0...255]) according the 0/1
    > probability of node state?

    Here it sounds more like using the statistics as a context,
    than a fuzzy context, but if you mean that, then its common enough,
    known as SSE, APM, or a quantized history mapping.
    But that's actually different from my idea of a "fuzzy context",
    which imho should be either a quantized context (by coding with
    overlapped intervals, for example), or a mix of all contextual estimations
    with weights determined by similarity.

    > however at the moment of 0/1 prediction the only current
    > (longest context on trie) is used to asses P(0/1).

    Well, the worst case for such a model would be a dictionary file
    (like english.dic) - actually PPM and BWT have the same problem
    with it, but it would be even worse if you'd just use the maxorder
    context for a bit.
    I mean, words have a lot of common suffices, which PPM would use
    as a context for the next word... but statistics would be always
    completely wrong if its a dictionary.
    And though less frequently, but the same happens with plaintext
    as well - its quite common to get "accidental" high order contexts,
    which would provide random predictions, which is especially bad
    for PPM.

    >> (Because with mixing its actually hard to get more than 216k)
    > Quite encouraging ;o(

    Well, its a fact: last time I implemented a tree for bytewise
    statistics (plain numbers of context occurences) and when
    a dummy model was added (linear mix of contextual probability
    distributions, with fixed weights), and right away it was 215.7k or something.
    http://ctxmodel.net/files/tree_v1.rar
    Also, ppmy is not much smarter than that too.

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Some results for book1:

    dmc -> 237997
    hook 1.3 -> 232857

    However, hook uses LZP preprocessing.

  9. #9
    Member
    Join Date
    Mar 2009
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You are right, on book1 DMC shows 237997; actually quite good; I tested it a long ago on mixed content tar file and compared with RK (not sure) and DMC looked not impressive at all. Apparently I did a lousy job; thanks,
    Stefan

  10. #10
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    I picked up my experimenting again. The things I read here are also valid for the piece of algoritm I am constructing.
    Currently its doing ~208000 on book1 and 436000 on World95.txt so I'm happy with it.
    What I did to mix the order X 'hits' of the unbounded PPM together is this:
    sum = 0
    divider = 0

    for every order in current hits
    {
    e := getAverageErrorFor(order) // Average error of current hit, taking the bit-index, last byte and order in account. 0 >= e <= 0.5
    sum = sum + stretch(confidence(p1, e)) //p1 = more or less n1 / n
    divider = divider + (1 - (e * 2))
    }

    p = squash(sum / divider)

    output = APM(p) //Made myself an APM, not using PAQ version, but I call it APM here so everybody knows what it is.

    confidence = p1 adjusted, taking e in account. p1 = 0.5 when e = 0.5, p1 = p1 when e = 0

    Right now I am struggling for improvement. It feels that calculating e and defining the counter statistics is the place to be for improvements, but for 2 days now I am unable to improve things.
    If you count statistics, what do you use?

    PPMonstr supprised me. Book1 results in 202700 using PPMonstr. What's the trick there? Why is it so much better than mine?

    I know that my method is some sort of weighted average in the logistic space, but it works fine, because sadly my counter statistics on themself do not lead to usefull propabilities. Are there any other methods?

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Currently its doing ~208000 on book1 and 436000 on
    > World95.txt so I'm happy with it.

    That's better already, starts making sense to use CM instead of BWT

    > What I did to mix the order X 'hits' of the unbounded PPM together is this:
    > [...]
    > p = squash(sum / divider)
    > output = APM(p)
    > Made myself an APM, not using PAQ version, but I call it
    > APM here so everybody knows what it is.

    Somehow this looks like a bitwise CM, or am I misunderstanding something?
    PPM is the algorithm where you use a single context for encoding, with
    special "escape" symbols for switching contexts.

    > If you count statistics, what do you use?

    paq, ccm, m1 use state machines with 8-bit state for context history
    representation.
    ppmonstr/ppmd use plain symbol occurence counters (effectively 5bit afair).
    And then there's a "linear counter", with update similar to x=x*r+bit*(1-r).

    > PPMonstr supprised me. Book1 results in 202700 using PPMonstr.

    btw, ash04a result is 202932 (and 427355 for world95) and its source is available.

    > What's the trick there? Why is it so much better than mine?

    1. bytewise models are better for text compression -
    texts are not generated as a sequence of bits, and bytewise
    probability estimation seems to be more precise there.
    Even paq result with disabled word model was around 204k afair.
    2. nonstationary counters which give more weight to recent samples.
    3. recognition of "deterministic" contexts
    4. unary coding of byte values with SSE and additonal contexts

    Well, there're two main directions for improvement
    1. contextual analysis - eg. you can store longer context histories
    to files and compare the performance of your estimator to paq8
    (or whatever produces the best results for such sequences)
    2. weighting - there're some "common sense" dependencies in text,
    like, any symbol belongs to one specific context, and mixer weights
    are probabilities of given context being the right one.
    In other words, mixer is a secondary model which computes estimations
    using the primary estimations as context - there's no requirement
    to independently calculate some "prediction error" for each context
    or anything like that.

    Also, APM/SSE is not a "black box" either - again, its a secondary
    model. So its reasonable to use other contexts for it beside probabilities -
    like quantized number of different symbols in maxorder context or
    even plain current order, or something like a number of sequential
    successful guesses.

  12. #12
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Again I have some problems to make myself understood. You are right about the CM part. I started to make a PPM version, making use of order 0 to 255, but using the escape symbol did not result in good compression. Just like the person who started the topic.

    My algorithm does not use hashes, but scans for every new byte the entire history for matches of a certain length. (Yes, it is very slow, but I would like to know if it would help me to create a compressor with superior results) The process is optimised as far as I could. I wend from compressing book1 in a day to book1 in 10 minutes.

    The secondary version is able to scan for order 0 to 65535, but here the match length isn't in bytes but bits, so I have matches of a certain bit-length.

    Having said that I discovered that for text-files using bit-length-matches on byte-boundaries is very useful. The amount of errors in the prediction decreases, but it is increasingly hard to find a good mix for optimal entropy. The entropy of the byte-version is currently better compared to the bit-version. A combination of these 2 'models' creates the results from my last posting.

    To get the current compression, I maintain a 2 dimensional array of counters for every byte-wise prediction. This array contains the counters for the 256 possible bytes for every possible order. For the bitwise CM-like prediction I sort of convert this array to bitwise to get for every bit of the 8 bits, the c0 & c1 counters for a specific order and use this for the input described in my last posting.

    I have done several experiments with gaps and other alternative 'matches', but that results in a million statistics where it is very diffucult to say what's actually useful and what to ignore. Only the exact matches proved to be useful in almost every case. I made code to use book1 as input and select a random list of pattern-like contexts to find evolution-style the optimal set of patterns to use for compression. Almost all the alternative stuff lost it from plain old match-finding.

    Making use of the comments in Paq I am able to read and understand most of the sourcecode. Shelwien, I tried to read some of your code, but I have no scientific background and you use a lot of shortcuts and abbreviations in your code. They all make sense, but not directly for me. I know that it is lack in my knowledge. For example: x=x*r+bit*(1-r)
    What's x and what's r? I quess r = learning rate? x = the new counter value?

    My counters are nonstationary for months now. That improved compression a lot. That's what I have learned from you Shelwien
    What I have discovered is that for example a count of 1 occurence is close to useless for order < 4 and sometimes, but not always useful for order > 4. In that case I use an error-rate that decreases the p = c1 / (c0 + c1) to a prediction wich is closer to the truth. You could call that the escape symbol from PPM. I fill the error-part in the prediction with values found in a lower order where in fact this process repeats itself.
    The error-part sounds useless and crappy and I agree with that, but it helps me getting better predictions. Advice in this is welcome.

    Could you explain some more about "deterministic" contexts and SSE? Unary coding? Using Google does not result in useful help.

  13. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    really? the first hit from google was http://en.wikipedia.org/wiki/Unary_coding

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I have done several experiments with gaps and other
    > alternative 'matches',

    Normally its only usable for specific data types,
    like x86 code where repeatable patterns consist of
    some fixed bytes and some variable offsets and constants.
    But still, ppmonstr has a sparse submodel like that,
    which is also effective on some texts, including that world95.
    The output of that submodel is only used as a flag in
    SSE context, though.

    > For example: x=x*r+bit*(1-r)
    > What's x and what's r? I quess r = learning rate? x =
    > the new counter value?

    Yeah, its explained in more detail there - http://www.ctxmodel.net/rem.pl?-2

    > The error-part sounds useless and crappy and I agree with
    > that, but it helps me getting better predictions. Advice
    > in this is welcome.

    Well, one famous (and working) method for context weighting
    is CTW: http://en.wikipedia.org/wiki/Context_tree_weighting
    Though it might be more practical to just use a paq mixer
    with ad hoc context.

    Then, there's also my ppmy:
    http://compression.ru/sh/ppmy_3c.rar
    http://compression.ru/sh/ppmy_3c_sse.rar
    It uses plain freqs without rescaling, but still shows
    somewhat better results than yours (~206.9k).
    The weighting formula there uses escape freqs and "optimal"
    freqs (number of cases where given context provided the
    best prediction) to compute the context weights.

    > Could you explain some more about "deterministic" contexts

    In ppmd/ppmonstr/ppmy/ash these are contexts where
    only one symbol appeared (Shkarin calls them "binary" contexts though,
    because there's that symbol and escape in the alphabet).
    In practice, such contexts exist like for 50% of symbols in text,
    so precise probability estimations in these cases is really important.

    > and SSE?

    SSE is Secondary Symbol Estimation, by analogy with SEE =
    secondary escape estimation, a probability mapping basically.
    You should be able to understand how its used by comparing
    ppmy with ppmy+sse.

    > Unary coding?

    Well, its bitwise coding basically, but instead of encoding
    symbol bits will encode flags like (c=='a'), then (c=='b') etc.
    Of course, it would be impractical to just encode such flags
    starting with \x00, so actual implementations (ppmd etc)
    use some kind of symbol ranking.
    Btw SSE is most effective when used with unary coding.
    Last edited by Shelwien; 7th January 2010 at 17:53.

  15. #15
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    really? the first hit from google was http://en.wikipedia.org/wiki/Unary_coding
    4. unary coding of byte values with SSE and additonal contexts
    Can't read that on Wikipedia.

  16. #16
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Shelwien,

    It will take me some time to read, understand and experiment with the given advice and code.

    Thanks for your answers.

  17. #17
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    First try 234kb with a very basic escape-mechanism.
    There are just 2 counters for every order, measuring escapes and successes.

  18. #18
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Found out a way to correctly implement the paq-mixer. Doing <203k on book1 and <414k on world95.txt.
    When I found a way to improve world95.txt to <410k I will concentrate on speed optimalisation.

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Sounds interesting.
    So what is it? Another bitwise coder with hashtables and "word model" like paq?
    Or bytewise unary coder with SSE? (but you didn't mention SSE...)
    Somehow I don't think that 203k on book1 is possible with only symbol counts
    and logistic mixing - there has to be something else :)
    ...Also, did you check decompression? :)

  20. #20
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Currently it is a bitwise 'match-searcher' with byte-boundaries. 2 diverend counters, currently 6 main paq-mixers and extensive use of APM. No hashtables or wordmodel yet. 1 counter still manages to get <203k on book1. The second one is a faster changing version, useable for very unstationary files like world95.txt. The latest version is currently doing 202600 on book1 and 411.853 on world95.txt.

    But I partialy agree with you Shelwien. It starting to look like a stripped Paq-version, written in .Net. It is also very slow. World95.txt takes 6 hours to complete.

    Next step is to use hashtables/trees to speed up match finding and to convert the floating point operations to integer. After that it will be hard to say that I made something unique. It works great, but it is more like paq-revisited.

  21. #21
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Little update: the compression is now 201800 on book1 and 409700 on world95.txt. That should be good enough.
    Also speed performance is getting better. Currently World95.txt takes 3 hours to compress. Still not implemented hashes and integer math, so I expect loads of room for speed improvement.

    Only 1 model with counters for stationary and non-stationary data... I don't think that more improvement in compression is possible?
    (I want to stick to 1 model for this moment.)
    Last edited by Kaw; 16th March 2010 at 17:04.

Similar Threads

  1. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 02:47
  2. BIT Archiver
    By osmanturan in forum Data Compression
    Replies: 137
    Last Post: 16th January 2009, 20:19
  3. PPMX - a new PPM encoder
    By encode in forum Data Compression
    Replies: 14
    Last Post: 30th November 2008, 17:03
  4. Bit Archive Format
    By osmanturan in forum Forum Archive
    Replies: 39
    Last Post: 29th December 2007, 00:57
  5. TURTLE incoming... Fast PPM file compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 104
    Last Post: 8th August 2007, 20:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •