Results 1 to 27 of 27

Thread: PPMX v0.02 is here!

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts

    Talking PPMX v0.02 is here!

    Still a quite draft, but has more serious order-9 PPM model.

    This version has a special compile for Intel Core 2 Duo CPU - utilizing some new CPU features PPMX runs faster!
    Attached Files Attached Files

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Code:
    ENWIK8 -> 22,580,291 bytes (1.806 bpb)
    ENWIK9 -> 194,298,469 bytes (1.554 bpb)
    bookstar -> 9,613,763 bytes (2.161 bpb)
    book1 -> 227,280 bytes (2.365 bpb)
    world95.txt -> 493,122 bytes (1.320 bpb)
    fp.log -> 625,450 bytes (0.243 bpb)
    3200.txt -> 3,927,886 bytes (1.962 bpb)
    bible.txt -> 764,447 bytes (1.511 bpb)

  3. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    It's a bit stronger, but almost twice slower (on bookstar)...
    ~61s vs 33.422

  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    For comparison, performance on my machine (Cure 2 Duo @ 2.40 GHz, 2 GB RAM, Windows XP SP2):

    bookstar
    PPMX v0.01 - 18 sec
    PPMX v0.02 - 31 sec
    PPMX v0.02CORE2DUO - 26 sec


  5. #5
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Thanks Ilia!

  6. #6
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Could you post some technical information about your current scheme? I just wonder

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by osmanturan View Post
    Could you post some technical information about your current scheme? I just wonder
    A quite baseline PPM, max. model order is order-9 with symbol masking (full exclusion), no SEE, SSE, etc. I use hashing like you noticed before. Escape strategy is different for each model order - PPMD-like variant with highest orders and PPMC-like with lowest ones...

  8. #8
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    A quite baseline PPM, max. model order is order-9 with symbol masking (full exclusion), no SEE, SSE, etc. I use hashing like you noticed before. Escape strategy is different for each model order - PPMD-like variant with highest orders and PPMC-like with lowest ones...
    Hımm...Thanks! I found it a bit redundant and I think, the main reason could be hashing (and also slows down your coder). Do you have any idea? And do you have a plan to implement tree like approach? Or simply, what is the next?

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by osmanturan View Post
    And do you have a plan to implement tree like approach?
    I started with tree... But hashing is far more flexible - I can use any models with any order, additionally I can skip some "redundant" orders. In addition, no need in model rebuilding or reset!
    Quote Originally Posted by osmanturan View Post
    Or simply, what is the next?
    I'm experimenting with many PPM tricks - SEE, II (Information Inheritance), new counters, etc. etc.
    I want an extra compression (to beat PPMd, finally) with the same or even faster speed. And it is possible. I already made a stronger version - with some more advanced escape strategy which runs at the same speed, but overall difference is quite small - and I kept that basic variant - for me it's easier to add real SEE to such scheme instead of rewriting codeparts...

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > > And do you have a plan to implement tree like approach?
    > I started with tree...

    Really? And what happened?

    > But hashing is far more flexible -

    I don't see how its more flexible.
    Its just very simple to write.
    And even that for bitwise models only too.

    Although in fact, there won't be much
    difference from a tree when you'd (finally) start
    using variable-length context nodes.
    The main difference would be that you'd still have
    to compute the hash functions, while with a tree
    its just possible to follow a pointer.

    > I can use any models with any order,

    Which is quite possible with a suffix tree as well.
    You can store a "ode" context address into the suffix
    pointer of "encode" context, and there you'd have
    your escape from o6 to o3.

    > additionally I can skip some "redundant" orders.

    There's a matter of masked contexts though, but that'd
    only require some intermediate structures, like
    suffix pointer arrays or something.
    And only for the cases when a lower order context
    has some bits of data which are removed from a higher
    order context.

    Anyway, its not really necessary to regard this
    structure as a tree... its just a set of context
    nodes with links between them, which allow to
    incrementally (with a single pointer lookup per symbol)
    reach all the context nodes while parsing the data.

    Technically, a hash has a benefit of not keeping
    these pointers, but instead it has to be very
    redundant to provide a good speed.

    And another possible benefit is the ability to
    separately access the symbol's nodes, thus
    significantly simplifying the allocation,
    but again, it'd totally slow down any normal
    bytewise model.

    > In addition, no need in model rebuilding or reset!

    That's really the problem.
    But its just a matter of more programming.
    However model reset doesn't actually require any programming,
    and model cutoff can be implemented by reparsing a block
    of data after resetting the stats (I think that would be
    faster, seeing how much time ppmd's tree processing with
    -r1 takes),
    And then, its also easy enough to implement a sliding window.
    (though there're doubts about the performance of this approach).

    In any case, like this you always know what kind of statistics
    do you have and why, which is not really possible with a hash.
    Last edited by Shelwien; 3rd December 2008 at 01:09.

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts

    Talking

    As you mentioned before, it's not so easy to deal with trees... At some point, I could spend more time to fight with tree structures rather than experimenting with actual PPM modeling.
    With my hash tables I may change one digit to choose any model order, easily skipping from order-7 to order-3, or from order-12 to, say, order-5... Just in one click! I bet it wouldn't be so easy with suffix tree... That's what I call flexibility!
    Anyway, I think I should keep an eye on actual modeling, say, on things like counters or finally proper SEE implementation.
    In other words, I'm too lazy to fight with complex structures, weird memory allocation... ...and prefer to spend some time for reinventing the wheel!

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    http://cs.fit.edu/~mmahoney/compression/text.html#1943

    Version .02 compresses much smaller, but slower. The core 2 duo version runs faster on my new computer (Pentium T3200 dual core, 32 bit mode) than the regular version, but the regular version has a smaller .exe so I tested with that one. I tested both on enwik8, and both run in only one core.

    For a speed comparison, see lpaq9l which I tested on my old and new computers. The new one is about 10% faster on one core.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > At some point, I could spend more time to fight with tree structures
    > rather than experimenting with actual PPM modeling.

    Well, I got stuck at that stage myself, as there's no known "perfect"
    implementation of a "statistical parser".
    But its like that for a good reason too - the model design is highly
    affected by the kinds of available statistics.
    For example, PPM and CM approaches work with completely different
    statistics, and the choice to use some nonstationary counters
    instead of direct numbers of occurences basically blocks the
    possibility of a sliding window implementation.

    > With my hash tables I may change one digit to choose any model order,
    > easily skipping from order-7 to order-3, or from order-12 to, say, order-5...
    > Just in one click! I bet it wouldn't be so easy with suffix tree...

    Well, just that you could do with a tree as well.
    And with that kind of memory consumption like you have,
    it'd be pretty much ok to have the full statistics for all
    orders, but just skip some of them.
    The only real complication could be with non-uniform
    contexts (masked or sparse or constructed in some other way),
    which you don't yet use anyway.
    Though I guess its always possible to use hashes for
    some submodels where they're really convenient.
    (like models predicting a single symbol).

    > That's what I call flexibility!

    That's ok for now I guess, but somehow I think that to beat
    ppmd in speed you'd need a tree anyway, and then building
    a tree with exactly the same properties as your hash system
    would be really problematic.
    That's basically why I don't try to skip that stage myself.
    And I did try out the bitwise approach too (in o6mix), just
    to make sure, and it still seems that it doesn't really
    work as a workaround for bytewise processing.
    Of course we have some good examples of how competitive
    bitwise models are at some range of speed/compression/memory
    balance, but imho its also obvious that they're quite
    close to the limit there, leaving very little space
    for improvement. While with something like ppmd its
    not obvious at all, considering fast compression, and
    even less obvious with ppmonstr, considering high compression.

    However bytewise approach itself is problematic too, as
    larger alphabets are easily available (like unicode, or
    numbers, or words in text), but anyway its still much
    better than bitwise in these cases, and I don't have
    even a slightest idea on how to implement a real word-wise
    model (as its not only a matter of counting word occurrences
    and encoding a word from a dictionary, the main problem is
    that probability estimations, updates, inheritance and such,
    all have to be implemented in some unknown but complex way).

    > Anyway, I think I should keep an eye on actual modeling, say,
    > on things like counters or finally proper SEE implementation.

    So, did you already try finding a good example of context history
    and separately (from PPM) developing a model for it?

    In fact, its the most easy and convenient way to do it
    (dumping a context history into a file and making a standalone
    compressor for it) as like that you don't have to care
    about a lot of stuff (of course, all the extra info, like
    contexts, has to be dumped as well).
    Btw, escapes can be replaced with some unused symbol, and
    also its useful to actually store a lot of context histories
    with terminators into a single file, and process that file
    with a single simple model with flushes on these terminators.
    Like that you can even have the same output size as with
    a full compressor, just sequentially processing the histories
    instead of doing it in parallel, and having to care about
    memory management and such at the same time.
    And also its possible to use other compressors for estimation.
    Anyway, that's how I do it.

    > In other words, I'm too lazy to fight with complex structures,
    > weird memory allocation... ...and prefer to spend some time
    > for reinventing the wheel!

    Of course its your choice, just that any model really depends on
    the kind of available statistics, and that is affected by used
    data structure - like counter size, or availability of context
    counters (=totals), or hash collisions.
    Thus, especially if its a fast model (=simple and not very adaptive),
    its better to always remember that it'd probably be almost
    unusable when attached to any other source of statistics.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @Matt Mahoney:

    Btw, I always wanted to ask, did/do you perform any analysis on
    context histories stored in separate files?
    And also are all these numeric constants found manually or
    automatically?

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Thanks Matt!

    Quote Originally Posted by Shelwien
    So, did you already try finding a good example of context history
    and separately (from PPM) developing a model for it?
    To be honest, I never try it. I tune and play with all model parameters on the fly with PPM modeling. I think at some stage I will try it!

    Quote Originally Posted by Shelwien
    The only real complication could be with non-uniform
    contexts (masked or sparse or constructed in some other way),
    which you don't yet use anyway.
    BTW, I'm thinking about sparse and word models. Just in terms of PAQ, apart from standard N-gram models I may use sparse and word ones. But if with PAQ we do mixing - i.e. use predictions from ALL models simultaneously, with my PPM I should choose exact model sequence, say:
    word model
    order-7 model
    order-5 model
    sparse model
    order-3 model
    etc.
    i.e. if model can't predict, it drop an escape symbol and go to the next model. Additionally, I may use kind of LOE heuristic (Invented by Alexander Loe ) to choose the best initial model. This would be really interesting!

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Still, there're some doubts about compatibility of non-ordered
    submodels with escaping. Certainly LOE would be useful, but
    even with it really good results are not guaranteed.
    Well, I guess its still worth experimenting.

    Btw for reference, the only PPM with non-ordered submodels
    known to me is ppmonstr, and the sparse submodel there
    predicts the next symbol instead of a probability, and
    a comparison with that symbol is used in SSE contexts.

  17. #17
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    BTW, can you explain the difference between PPMDet (we keep only one symbol per context) and unary coding (PPMonstr-styled)?

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Unary coding loops in ppmonstr look nearly the same as these in ppmd.
    Only difference is that the symbol's probability is adjusted.

    Code:
    s = total;
    for( i=0; i<NSymbols; i++ ) {
      p0 = freq[i]/s;
      s -= freq[i]; // mask the processed symbol
      p = SSE( p0 );
      flag = (c==symbol[i]);
      rc.Process( flag, p );
      if( flag ) break;
    }
    And btw its not necessary to have a rc call per each symbol,
    but then some interval operations similar to rc have to be
    used anyway.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Shelwien View Post
    @Matt Mahoney:

    Btw, I always wanted to ask, did/do you perform any analysis on
    context histories stored in separate files?
    And also are all these numeric constants found manually or
    automatically?
    No, it was easier to experiment with the models in the compressor. The numeric constants I tuned manually. You can also ask Alexander Rhatushnyak who has tuned the constants in my code to the extreme. I don't have the patience to spend that much time on it.

  20. #20
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Automatic parameter optimization

    Looks like you missed my automatic parameter optimization via a genetic algorithm. Note that also quantization mappings, context masks and stuff like that can be optimized. Have a look at this: http://encode.ru/forum/showthread.php?t=159

    BTW: did anybody every try to integrate the optimizer into a different program?
    Last edited by toffer; 3rd December 2008 at 19:01.

  21. #21
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by toffer View Post
    BTW: did anybody every try to integrate the optimizer into a different program?
    AFAIK, only a few people uses automated optimization - even simple bit flipping. So, I don't expect that anyone used your optimizer

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    epm (described at http://cs.fit.edu/~mmahoney/compression/text.html#1749 ) is one of the few compressors I know that use automatic parameter optimization.

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    The first archiver with public parameter optimizer was, in fact, bee
    by Andrew Filinsky. http://compression.ru/fa
    Its been there since at least 1999-2000. And even has a config with
    different parameter strings by file extension or something.
    http://compression.ru/fa/files2/bee077src.rar
    (0.7.8 doesn't include a compiled beeopt for some reason).

  24. #24
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    So, it seems only 5 authors use automatic parameter optimization in compression area
    Last edited by osmanturan; 7th December 2008 at 21:44. Reason: Spelling...

  25. #25
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    If someone wants, it's possible to continue bee .It's open-source GPLv2 and have a good GUI.

  26. #26
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by lunaris View Post
    and have a good GUI.
    Who cares? Does it have a good command line?
    Don't have to answer, I'm not taking this anyway.

  27. #27
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Working heavy on PPMX v0.03. I already rewritten the entire code of PPMX, added many code optimizations and new escape strategy. Making PPMX both notable faster and stronger.

    Code:
    ENWIK8 -> 22,489,310 bytes
    ENWIK9 -> 193,275,280 bytes


    To be continued...

Similar Threads

  1. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 02:47
  2. ppmx v0.04 is here!
    By encode in forum Data Compression
    Replies: 62
    Last Post: 17th January 2009, 14:57
  3. ppmx v0.03 is here!
    By encode in forum Data Compression
    Replies: 13
    Last Post: 1st January 2009, 03:21
  4. PPMX - a new PPM encoder
    By encode in forum Data Compression
    Replies: 14
    Last Post: 30th November 2008, 17:03

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •