Results 1 to 18 of 18

Thread: decompression is ~1000 times slower than compression?

  1. #1
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    231
    Thanks
    36
    Thanked 78 Times in 42 Posts

    decompression is ~1000 times slower than compression?

    Can you think of a compressor with compression speed thousand times higher than decompression speed?
    Do you know publications mentioning or discussing such unusual cases?

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    paq on multi-core cpu. its algorithm allows to execute all models independent at compression, but not at decompression

  3. #3
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    231
    Thanks
    36
    Thanked 78 Times in 42 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    paq on multi-core cpu. its algorithm allows to execute all models independent at compression, but not at decompression
    Let's consider a 4-core CPU. Should compression be ~4 times faster than decompression, after implementation is optimized for a 4-core CPU ? Will only one core be occupied during decompression?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,966
    Thanks
    153
    Thanked 803 Times in 400 Posts
    Yeah, it should be possible to make paq 4x faster on a quadcore.
    Comparing versions fully speed-optimized to 1 and 4 cores is another
    matter though, as paq is not really optimized as it is.

    But anyway, its a bit hard, but possible too
    (decompression being much slower than compression)
    For example, if we'd implement a "normal" BWT compressor
    and reduced-memory decompressor (like in bwmonstr).

    Also there're quite a few other cases where normally additional
    complexity is added to the compressor, but its possible to make
    the compressor a bit faster and decompressor much slower.
    I mean stuff like ABS coder, code stream (de)interleaving and
    various transforms.

    Also such cases might be common in recompression - for example,
    instead of detecting the deflate settings at compression side in precomp,
    we could just store the inflate'd data and a hash of original block, and
    leave the bruteforce to decoder.

    Anyway, I can think of more examples if necessary, and they might
    even make some sense (usually the encoding speed can be slightly
    improved at the cost of significant decompression slow-down),
    but its not quite practical anyway :)
    Last edited by Shelwien; 25th April 2010 at 08:42.

  5. #5
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    Let's consider a 4-core CPU. Should compression be ~4 times faster than decompression, after implementation is optimized for a 4-core CPU ? Will only one core be occupied during decompression?
    There are 1000+ core machines out there, so if we take aside possible scaling issues, 1000 fold difference is possible.

  6. #6
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    231
    Thanks
    36
    Thanked 78 Times in 42 Posts
    Quote Originally Posted by m^2 View Post
    There are 1000+ core machines out there, so if we take aside possible scaling issues, 1000 fold difference is possible.
    Do you mean that only one core will be busy during decompression? When you get predictions from models, why can't you process half models on another core? After the bit is successfully decompressed, why can't you update half models on one core, and half models on another core?

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,966
    Thanks
    153
    Thanked 803 Times in 400 Posts
    Of course, there're some options for paralleling in decoding too, but very limited.
    If the file was encoded sequentially, you basically can't decode the next bit,
    until you know the previous bits (speculative decoding is possible, but doesn't
    really change anything).
    Sure its still possible to run parts on different cores, but with sync on each bit,
    it would likely end up being slower than single-threaded version.

  8. #8
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    231
    Thanks
    36
    Thanked 78 Times in 42 Posts
    Quote Originally Posted by Shelwien View Post
    Sure its still possible to run parts on different cores, but with sync on each bit,
    it would likely end up being slower than single-threaded version.
    Why is synchronizing on each bit so bad that decompressor optimized for (and running on) a 2-core CPU is likely to run slower than decompressor optimized for (and running on) a single-core CPU with the same frequency, cash sizes, etc.?

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,966
    Thanks
    153
    Thanked 803 Times in 400 Posts
    Well, unfortunately that's how it is.
    The environment on PC is completely unstable, so its basically impossible
    to force two threads to do different work in the same amount of cpu clocks
    and then exchange their info without waiting.

    The easiest thing would be to compute the probability estimation in one
    thread and do the update in another, so that update for previous bit
    and estimation for the next can overlap.
    But then we'd have to verify whether estimation uses any of update results,
    and wait for these if it does.
    And then, we'd have to ensure that the update finishes before the next one,
    and transmit the decoded bit to update thread.
    Thus, both threads would waste quite a bit of time waiting (eg.10-20% easily),
    and we'd need additional checks for context matches.
    Now, this still might improve speed with a complex model like paq8,
    but with faster CMs (like ccm,m1) it just doesn't make sense.
    Code:
    | ctx check/wait | wait for bit  |
    | P()            | --            |
    | bit decoding   | Update()      |
    Alternatively, we can run some of submodels (both estimation and update)
    in different threads. For paq8 this might be actually better because
    there's a lot of independent submodels.
    Code:
    | M1.P()          | M2.P()       |
    | wait for M2.P   | wait for bit |
    | mixing          | --           |
    | bit decoding    | --           |
    | M1.Update       | M2.Update    |
    But still even with paq8 its unlikely that it would be better than 1.5x
    with 2 threads, and further scaling would be even worse.

    Anyway, I don't say that CM decoding speed can't be improved
    by multithreading, but comparing to encoding the possibilities are
    very limited.
    Encoder has access to all the data at once, so its much easier
    to run in parallel with barely any sync at all (blockwise).

  10. #10
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    Well, unfortunately that's how it is.
    The environment on PC is completely unstable, so its basically impossible
    to force two threads to do different work in the same amount of cpu clocks
    and then exchange their info without waiting.
    There are platforms that do better in this regard.
    Tilera CPU and their Zero Overhead Linux can warrant no context switches in particular cores.

    Also another scaling idea, not related to Tilera:
    First, for each bit of the source, count context statistics for some fixed length before it. Having the statistics, run all the models on the bit. Then comes mixing. If you use static weights, whole algorithm is inherently parallel (though the statistics counting does the same work multiple times in different threads...). If you do something smarter, then you might end up with a part that's sequential, but if lightweight enough, it might allow huge scaling too.
    Last edited by m^2; 25th April 2010 at 20:48.

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    Why is synchronizing on each bit so bad... ?
    are you tried to develop multi-threaded programs? it's a common knowledge - you need to perform rather large amount of work between synchronization points, but may be you have another experience?

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    2,966
    Thanks
    153
    Thanked 803 Times in 400 Posts
    > There are platforms that do better in this regard.

    Even just having a memory cache is enough to demonstrate
    100x differences in timings of the same code.
    Also, with more threads there're more points of sync... and each
    potentially wastes time. So when the time required to complete
    the task allocated to a thread becomes smaller than the time
    wasted on sync, the whole process likely becomes slower than
    single-threaded version too.

  13. #13
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    231
    Thanks
    36
    Thanked 78 Times in 42 Posts
    Regardless of how many cores the CPU has, if memory is split to fast memory and slow memory (e.g. RAM and HDD) and a random access to fast memory typically costs F cycles, while a random access to slow memory costs S cycles, then context-mixing-compression on this system can be up to (S+c)/(F+c) times faster than CM-decompression.
    Because you can design the algorithm so that during compression almost all random accesses to memory fall to fast memory, and during decompression almost all of them fall to slow memory: all models can work in turn in compressor (therefore each model can use almost all fast memory), but in decompressor you must run through all models for each bit of data being decompressed.

    What are the advantages of such design? For example, if compressor knows how much fast memory the decompressing side can allocate, it can initialize and run the appropriate number of models; e.g. 2 Gb RAM in decompressor => compressor having only 128 Mb of RAM can run up to 15 models consecutively, each of these models using ~128 Mb of RAM (although decompression on the device with 128 Mb of RAM will be either (S+c)/(F+c) times slower or impossible).
    Last edited by Alexander Rhatushnyak; 27th April 2010 at 08:10.

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    srep -m1 may be interesting example, if you need algo that compress in ram and decompress using hdd

  15. #15
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Alexander Rhatushnyak View Post
    Can you think of a compressor with compression speed thousand times higher than decompression speed?
    Do you know publications mentioning or discussing such unusual cases?
    An algorithm which works with integer prime-factor multiplication for example. The inverse integer factorization is dead slow. There exist satelite link image compressors which work that way. Don't have the reference in the moment. I think a lot image compressors which have to compress on very simple machines which are not exchangeable (taking out the cartridge, no possibility of a cable, and so on) compress faster than decompress. The destination has all resources necessary.

  16. #16
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Alexander Rhatushnyak View Post
    Can you think of a compressor with compression speed thousand times higher than decompression speed?
    Do you know publications mentioning or discussing such unusual cases?
    Another "common" case would be if you have an complex implicit set of rules, and you create a source-string for that it's too complex to maintain related statistics. The sender may instantly produce the next symbol, while the receiver has to apply and exclude all rules which are violated to reach the unique valid set of possible symbols. The trellis code is an example if this kind of implicity.

    Other trivial case, when the compressor has a huge amount of memory and the decompressor almost none you may have 1:1000 ratios.

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts
    Quote Originally Posted by Ethatron View Post
    An algorithm which works with integer prime-factor multiplication for example. The inverse integer factorization is dead slow. There exist satelite link image compressors which work that way. Don't have the reference in the moment.
    I think that would be RSA encryption rather than compression.

  18. #18
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Matt Mahoney View Post
    I think that would be RSA encryption rather than compression.
    The referred image compressor may have used error correcting codes based on primes, as satelite links may suffer from severe signal distortion. Too bad I don't find the paper, and I'm allready using full searches.

Similar Threads

  1. Decompression speed test
    By m^2 in forum Data Compression
    Replies: 8
    Last Post: 17th August 2010, 23:42
  2. Best decompression on embedded device
    By ikes in forum Data Compression
    Replies: 7
    Last Post: 16th February 2009, 22:05
  3. Data decompression on in-memory kernel
    By cregd in forum Data Compression
    Replies: 8
    Last Post: 27th January 2009, 18:24
  4. Compiler related: Intel's code slower on AMD-CPUs?!
    By Vacon in forum Data Compression
    Replies: 5
    Last Post: 10th May 2008, 17:56
  5. Fast decompression of big files
    By SvenBent in forum Forum Archive
    Replies: 16
    Last Post: 8th March 2008, 20:17

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •