Results 1 to 18 of 18

Thread: PPM o4 vs ST4

  1. #1
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post

    PPM o4 vs ST4

    Could someone make a fair comparison of PPMX v6 and BSC with ST4 transform. It would be interesting to compare ST4 vs PPM performance in both compression ration and speed. In order to make it more fair please use "-b14p -m1" or "-b14p -m1f" this will disable all preprocessing, OpenMP support and use same memory in bsc like in ppmx.
    Enjoy coding, enjoy life!

  2. #2
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @gribok:

    excuse me - but why do you want to compare PPMX 006 with BSC -m1 "ST4 transform" ?

    why not compare with -m2 "ST5" ?

    - in my short test your ST5 - mode beat the ppmx 006 clearly
    it works faster and it compresses better "in all my tested cases"

    what is special in your ST4 - mode ?
    for me it seems that the ST5 - mode is clearly more powerful

    in which cases the ST4 should be better then ST5 ?

    best regards

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/bsc226.htm
    Also some CMs in the same circumstances:
    http://nishi.dreamhosters.com/u/m1x2.htm

    @joerg: "special" about ST4 is that its the same o4 as PPMX.

    @gribok:
    BWT, and especially ST4, is expected to be faster than PPM.
    However, PPM should win in compression (and it does on ie bookstar).
    In theory, PPM's benefits vs normal BWT (not m03-style) should be visible
    on nonstationary data and/or memory usage limited to something considerably
    smaller than size of input data.

    But, unfortunately, PPMd has a fairly stationary model (except for SEE), and
    blockwise behaviour similar to that of BWT, so usually its only barely better
    than BWT codecs on text.

    Still, imho it doesn't mean that PPM can't do better and doesn't have any
    benefits over BWT/ST. PPM has a lot of possibilities which BWT blocks,
    like better adaptivity and format parsing. But unfortunatly it requires a lot more
    work than stuff based on hashtables and libdivsufsort.

  4. #4
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by joerg View Post
    @gribok:
    excuse me - but why do you want to compare PPMX 006 with BSC -m1 "ST4 transform" ?
    PPM and ST are different methods, but in this case both of them will use statistics of same o4 order. Results should be interesting.
    Enjoy coding, enjoy life!

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Well, simple test with enwik8 (dual core T3200, 2 GHz, 3 GB, Vista 32 bit)

    24,017,318 bsc-x86 2.2.6 -m1 -b100, 10.9 sec (164% CPU), 500 MB
    25,694,806 ppmd -o4 -m37, 18.3 sec (1 core), 36.3 MB
    23,475,434 ppmonstr -o4, 165.4 sec (1 core), 36.5 MB

    order 4 test with calgary corpus as a tar file:

    3,152,896 calgary.tar
    804,566 calgary.bsc
    804,992 calgary.pmd
    715,447 calgary.pmm

    As separate files:

    27,825 BIB.bsc
    222,355 BOOK1.bsc
    152,325 BOOK2.bsc
    54,567 GEO.bsc
    113,067 NEWS.bsc
    10,774 OBJ1.bsc
    73,063 OBJ2.bsc
    16,255 PAPER1.bsc
    24,417 PAPER2.bsc
    47,060 PIC.bsc
    12,262 PROGC.bsc
    15,283 PROGL.bsc
    10,249 PROGP.bsc
    17,535 TRANS.bsc
    797,037 bytes

    25,443 BIB.pmd
    216,021 BOOK1.pmd
    149,404 BOOK2.pmd
    55,403 GEO.pmd
    109,407 NEWS.pmd
    9,419 OBJ1.pmd
    72,551 OBJ2.pmd
    15,004 PAPER1.pmd
    22,912 PAPER2.pmd
    50,095 PIC.pmd
    11,332 PROGC.pmd
    14,751 PROGL.pmd
    10,193 PROGP.pmd
    16,965 TRANS.pmd
    778,900 bytes

    24,769 BIB.pmm
    213,137 BOOK1.pmm
    142,298 BOOK2.pmm
    48,171 GEO.pmm
    103,670 NEWS.pmm
    8,197 OBJ1.pmm
    57,043 OBJ2.pmm
    14,460 PAPER1.pmm
    22,295 PAPER2.pmm
    32,006 PIC.pmm
    10,619 PROGC.pmm
    13,348 PROGL.pmm
    9,070 PROGP.pmm
    15,334 TRANS.pmm
    714,417 bytes
    Last edited by Matt Mahoney; 29th July 2010 at 05:33. Reason: ppmd -b37 -> -m37

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @Matt:
    Any comments considering the choice of coders and their options? :)
    To me they don't seem comparable at all.
    Btw, ppmonstr -o4 isn't really o4 - there's unary SSE and a sparse submodel added to that.

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    ppmd is certainly the most popular PPM compressor. ppmonstr is stable and has very good compression. I selected maximum memory options in all cases. (Typo: ppmd -b37 should say -m37).

    Here are some results for ppmx 0.06

    26,131,726 enwik8.ppmx, 28.3 sec (1 core), 71 MB

    816,877 calgary.tar.ppmx

    26,167 BIB.ppmx
    221,728 BOOK1.ppmx
    153,109 BOOK2.ppmx
    58,976 GEO.ppmx
    112,839 NEWS.ppmx
    9,900 OBJ1.ppmx
    73,980 OBJ2.ppmx
    15,463 PAPER1.ppmx
    23,696 PAPER2.ppmx
    51,478 PIC.ppmx
    11,697 PROGC.ppmx
    15,143 PROGL.ppmx
    10,478 PROGP.ppmx
    17,391 TRANS.ppmx
    802,045 bytes

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    ppmx 0.05

    22,905,422 enwik8.ppmx, 95.1 sec, 576 MB

    775,647 calgary.tar.ppmx

    24,782 BIB.ppmx
    219,426 BOOK1.ppmx
    145,995 BOOK2.ppmx
    58,508 GEO.ppmx
    107,509 NEWS.ppmx
    9,825 OBJ1.ppmx
    69,117 OBJ2.ppmx
    15,009 PAPER1.ppmx
    23,104 PAPER2.ppmx
    50,547 PIC.ppmx
    11,327 PROGC.ppmx
    13,463 PROGL.ppmx
    9,470 PROGP.ppmx
    14,945 TRANS.ppmx
    773,027 bytes

  9. #9
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    ppmd is certainly the most popular PPM compressor. ppmonstr is stable and has very good compression. I selected maximum memory options in all cases.
    I am far from arguing with an expert: but my user opinion, I am targeted mainly on using English-text [de]compressor, is that bsc is the way better than PPMONSTR, or I am wrong!?

    Regards

  10. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts

    Lightbulb

    I guess it should be noted that PPMX/PPMd and BSC has different symbol coding, meaning PPMd and especially PPMX has very basic byte-wise count based model, at the same time BSC has bit-wise CM with a few advanced things such as SSE... So, ST4 probably better to compare to a simple order-4 CM compressor. Well, this simplest and fastest CM will be released by me in a few days - I will just transform PPMx into CM (FCM)...

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    In theory, the bytewise rangecoding is one of PPM's advantages.
    Divisions are fast these days and allow for decoding of complex unary codes with a single rc lookup.
    Could be useful for bsc

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    ...but symbol probability computation is primitive...

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    It doesn't have to be like that.
    Basically, the idea is to do rc renormalization only once per some code tree, instead of per bit.
    Sure, it can be tricky to avoid overflowing rc precision, but its likely worth it.
    For example, afair fpaq0pv4B does rc renorm only once per two bits, or something.

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    No, I meant that bytewise counters are more restricted, with bit probabilities we've got a freedom - easy mixing (including double counters), SSE, etc. etc. We may easily do 64-bit new range calculation... (is bitwise arithmetic coder is more precise/efficient?)

  15. #15
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Shelwien View Post
    In theory, the bytewise rangecoding is one of PPM's advantages.
    Divisions are fast these days and allow for decoding of complex unary codes with a single rc lookup.
    Could be useful for bsc
    Could you please give me more details.
    Enjoy coding, enjoy life!

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Speed of multiplications and divisions depends on the size of arguments so its likely not very good idea even on x64.
    And on x86 the 64-bit multiplication would be slow as hell... some compilers (ie MSC) might even generate a function call for it.
    And symbol counters and rc interface are not really that strongly connected.
    Bitwise rc just does renormalization more frequently, so in some cases it can really be more precise, but its nothing significant.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Could you please give me more details.

    For example, you can look at ppmd source, eg.
    http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar /ppmd_proc1.inc
    As you can see there, that function encodes one of symbols in context (up to 256 are possible),
    but only calls rc once (well, twice on decoding).

    In other words, there may be a cheaper way to encode your stuff than the usual bitwise rangecoding.
    Like, if you know that (rc.range>=(1<<24)) (it just got normalized), and you work with 12-bit
    probabilities, then its certain that you can encode 2 bits without range renomalization, with a guarantee
    that range won't turn into 0.
    Same way, if you'd carefully setup the probability scales within some more complicated coding function,
    it may be possible to do renormalization once per symbol.

    This approach is also natural for bytewise frequency-based coders (like ppmd), though that may be
    more troublesome to make use of. Still, its worth remembering that encoding the choice of
    a symbol from alphabet of size N>2, when all symbol probabilities are known in advance,
    only requires single rc call. Additionally it can be used to encode uncompressed numbers from
    a given range (its a symmetric encode/decode function):

    Code:
     template <class INT>
     void NProcess( INT &n, uint lo, uint hi ) {
       int r = hi-lo;
       n -= lo;
       if( DECODE ) n = GetFreq(r);
       Process( n, 1, r ); // subrange [n..n+1) of [0..r)
       n += lo;
     }
    Also, consider checking out http://ctxmodel.net/files/fpaq0pv4b3.rar
    for some other rc-related optimizations, eg. 16-bit i/o and single renorm branch
    in rc_sh2d.inc, and rc state inlining (keeping rc state vars as unrelated local
    vars instead of a strict structure) in -tpl versions.

  18. #18
    Member Sanmayce's Avatar
    Join Date
    Apr 2010
    Location
    Sofia
    Posts
    57
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I wish I could understand what you are talking about, but I can provide only some tests&thoughts.
    I don't want to interrupt your discussion, so if you are interested in requested testing from the first post, there is it.

Similar Threads

  1. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 02:47
  2. PPM with sparse contexts
    By encode in forum Data Compression
    Replies: 5
    Last Post: 9th July 2010, 02:37
  3. Poor compression of bit-version of PPM
    By Stefan in forum Data Compression
    Replies: 20
    Last Post: 16th March 2010, 16:58
  4. PPMX - a new PPM encoder
    By encode in forum Data Compression
    Replies: 14
    Last Post: 30th November 2008, 17:03
  5. TURTLE incoming... Fast PPM file compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 104
    Last Post: 8th August 2007, 20:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •