Results 1 to 11 of 11

Thread: lpaq1 is here

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,604
    Thanks
    125
    Thanked 364 Times in 235 Posts
    I wrote a "lite" version of paq8l called lpaq1, 35 times faster but files are 10% bigger (but still #8 on enwik9). It is a file compressor, not an archiver. http://cs.fit.edu/~mmahoney/compression/lpaq1.zip

    I wanted something reasonably fast and simple and open source (GPL). It does not have any filters or specialized models (except 1 for text). It mixes only 7 contexts for speed (orders 1, 2, 3, 4, 6, a unigram word context, and a match model). It uses a neural mixer and 2 SSE stages. Source documentation explains how it works.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    3,587
    Thanks
    150
    Thanked 192 Times in 106 Posts
    now, it's something for real use!!! i think that CCM would be a best competitor

    Matt, is it possible to divide PAQ/LPAQ work among several CPU cores?

  3. #3
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    74
    Thanks
    2
    Thanked 1 Time in 1 Post
    Bump. I think lpaq needs more hype here.

    I agree completely with Bulat's post and also would like to see an answer to his question.
    I think Matt needs a medal or something

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,604
    Thanks
    125
    Thanked 364 Times in 235 Posts
    Yeah, CCM is faster but it's not open source. I don't know how it does it. I put some effort into optimizing lpaq1 for speed. I'm not sure what else I could change that wouldn't hurt compression.

    In theory a CM compressor could be spread across multiple cores by running the models in separate threads. This would be easier to do in a compressor with lots of models like paq8l. The models would need separate hash tables (currently they are shared in lpaq1) and need to be synchronized for every bit prediction. The steps of combining the prediction, the SSE stages and arithmetic coding are serial and have to be done in one core. So I don't think there would be a big speedup.

    The other approach would be to split the input into separately compressed blocks like parallel bzip2. This would hurt compression because you could not use redundancy between blocks.

    It will be interesting if someone comes out with a faster version.

  5. #5
    Member
    Join Date
    Feb 2009
    Location
    USA
    Posts
    58
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Incredible program Matt. I too am curious as to the secret of CCM's speed. According to Christian, a lot of the speed comes from "cache tricks". Let's hope he decides to go open source

    It'd be interesting to see the speed difference if you compiled it on an intel compiler.

    Here's some comparison between CCM and lpaq on my AMD Athlon 3000+.

    ccm c enwik8 enwik8.ccm
    22,318,446 Bytes (148 MB Memory) 1min 39sec

    ccmx c enwik8 enwik8.ccm
    21,601,446 Bytes (148 MB Memory) 2min 16sec

    ccmx c 7 enwik8 enwik8.ccm
    20,819,656 Bytes 2min 19sec

    lpaq 5 enwik8 enwik8.lpaq
    20,389,506 Byes (99 MB Memory) 5min 7sec

    lpaq 6 enwik8 enwik8.lpaq
    20,078,550 Bytes (195 MB Memory) 5min 22sec

    lpaq 9 enwik8 enwik8.lpaq
    19,755,948 Bytes (1539 MB Memory) 6min 10sec

  6. #6
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,342
    Thanks
    88
    Thanked 55 Times in 34 Posts
    The Return of the
    Matt Mahoney! Fantastic Job !

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    3,587
    Thanks
    150
    Thanked 192 Times in 106 Posts
    actually, in my test lpaq shows itself as a close competitor to ppmonstr rather than ccm. just a few numbers:

    3.800 kb - ppmd
    3.500 kb - ccm
    3.200 kb - ppmonstr and lpaq

    lpaq is also a bit slower than ppmonstr, but it is the only *open-source* program for the moment falling into this speed/compression niche

    Matt Mahoney
    yes, we all know that input may be divided into independent blocks

    about dividing models between cores - is it possible to run several models simultaneously so that each model converts input stream into streams of bits predicted by this particular model and then one more thread combine their predictions? this should allow asynchronous execution of models

    are you know how much time is spent in each part of lpaq?

  8. #8
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks Matt!

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    2,604
    Thanks
    125
    Thanked 364 Times in 235 Posts
    Yes, models can be divided among threads. Most of the time is spent in the higher order models due to cache misses. This problem does not go away using multiple cores. They still are limited by slow memory. You can see the effect by timing with different memory options.

    I designed the lpaq1 hash table to minimize cache misses. There are 2 misses per byte per model (5 models if order-1 fits in cache). One trick I used in paq8l but not lpaq1 is when a context is seen for the first time, just to store the next byte and defer the bit history updates until the next occurrence. This saves one cache miss if the context is not seen again.

    Some other optimizations: I could probably reduce the number of hash computations. (Note that g++ optimizes a<<b|a>>32-b to a single rotate instruction). Also, c0 and bcount are computed twice (in Predictor and MatchModel). I could fix this by combining the code or using global variables. But global/static variables take longer to compute (longer instruction addresses) e.g. in paq8l mix2() I copied globals into local variables to improve speed. I could also unroll loops in Mixer.

    Speed is similar to ppmonstr or durilca, but compression is not quite as good, so I think there is room for improvement.

  10. #10
    Member
    Join Date
    Feb 2009
    Location
    USA
    Posts
    58
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For kicks I compiled lpaq1 with the intel compiler and saw about an 8% speed increase on average (and my CPU's an AMD). It might be interesting to compare the assembly produced by gcc and icc, and see what causes the speed gain.

  11. #11
    Programmer giorgiotani's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    163
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That's a very interesting implementation!

Similar Threads

  1. New lpaq1 version
    By Matt Mahoney in forum Forum Archive
    Replies: 21
    Last Post: 29th October 2007, 01:35

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •