Results 1 to 21 of 21

Thread: CUDA LZSS

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts

    CUDA LZSS

    I've just stumbled upon a rather interesting project: http://homes.soic.indiana.edu/aozsoy/compression.html
    I don't have a CUDA-enabled card unfortunately and I cannot access the GitHub repos (well, I think that they're hosted on private GitHub Enterprise instance).

    Can somebody assess whether their results are impressive or not? Bulat?

    They are talking also about decompression on GPGPU which makes it even more interesting.
    Last edited by Piotr Tarsa; 23rd July 2014 at 15:44.

  2. #2
    Member
    Join Date
    Jun 2008
    Location
    Germany
    Posts
    369
    Thanks
    5
    Thanked 6 Times in 4 Posts
    Would be nice if that worked with OpenCL, so that ATI/AMD users were not excluded.

  3. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    I'm really skeptical about CUDA enabled LZSS. However, can somebody provide me that CUDA-LZSS executable to test it on my GTX 780 OC?

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Skeptical about what? From what I understood, they are doing LZSS with either 128, 256 or 512 bytes sliding windows and entire input is divided into chunks so that each CUDA thread has its own chunk. So it definitely can work.

    Problem is that it does have poor performance, I think. On page 177 (actually 8.) on http://homes.soic.indiana.edu/aozsoy...CS-journal.pdf there's table with performance summaries. CULZSS loses badly to GZIP when it comes to compression ratio. I wonder how CULZSS would compare to LZ4 on fastest setting. I think LZ4 would be both much faster and much stronger. And it's somewhat suspicious they're providing speed in Mbps instead of MBps.

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Skeptical about 512 byte window and some other aspects. Even jurassic LZSS' use 2..4 KB window. These days, 8..16 KB is essential minimum in my opinion. And of course I'm really confident with 32..128 KB window.
    I believe CUDA is for BWT or tricky CM - computationally intense stuff.
    Well, since LZSS bottlenecked mostly by memory access, fast DDR5 of GPU looks pretty attractive to me.
    I hope Intel will add 128 MB of L4 cache with its Broadwell CPUs, according to specs, L4 can be accessed not only by iGPU, but by CPU as well. Such cache may hugely improve compression speed of existing compressors!

  6. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Parallel processing seems to work better when data flows are predictable. Lossless compression is obligated to be opportunistic and unpredictable. I could see it being efficient to produce some kind of 1:1 transform. I'm not sure if the BWT is a good example.

    It probably shouldn't take as much memory to compute the BWT as it does now. It becomes compressible as you are sorting, so there should be a way to compress and sort at the same time (and still be fast).
    Last edited by nburns; 24th July 2014 at 08:18.

  7. #7
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Parallel processing seems to work better when data flows are predictable. Lossless compression is obligated to be opportunistic and unpredictable. I could see it being efficient to produce some kind of 1:1 transform. I'm not sure if the BWT is a good example.

    It probably shouldn't take as much memory to compute the BWT as it does now. It becomes compressible as you are sorting, so there should be a way to compress and sort at the same time (and still be fast).
    Why the 1:1 limitation on the transformation?

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I wonder why they say Huffman coding isn't well parallelizable. Firstly you do prefix sum on code lengths and then you have both offsets and codes so the problem becomes embarassingly parallel - at least for encoding.

    For entropy coding to be parallelized on decoding you need to somehow find out some code length boundaries before actual decoding so you can start decoding from them. You can store the boundaries in some helper vector together with the compressed stream but that negates some or all of the space savings. There are, however, some self-synchronizing codes eg Fibonacci code.

  9. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I wonder why they say Huffman coding isn't well parallelizable. Firstly you do prefix sum on code lengths and then you have both offsets and codes so the problem becomes embarassingly parallel - at least for encoding.
    How would you get the prefix sums without actually translating the symbols? By then, you'd have already done most of the work.

    However, if you encoded blocks without respect to offsets, the most you'd have to do is shift by a few bits and copy the block into place. But, Huffman isn't exactly compute-intensive to begin with, so maybe the copying is most of the work.
    Last edited by nburns; 24th July 2014 at 21:51.

  10. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Why the 1:1 limitation on the transformation?
    I think you'd want the length of the output to be predictable based on only the length of the input, not its contents. If processor #2 needs too much information from processor #1, the parallelism is lost. 1:1 wouldn't be an absolute requirement for parallelism in general. For lossless compression, the transform could not guarantee to take any less than equal space.

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    What do you mean by "translating"? Assuming you have a table which maps a symbol to code length it's trivial to map symbol to its code length After that mapping you do prefix sum.

    However, if you encoded blocks without respect to offsets, the most you'd have to do is shift by a few bits and copy the block into place. But, Huffman isn't exactly compute-intensive to begin with, so maybe the copying is most of the work.
    Even if you first split by blocks, you need to compute prefix sum on blocks sizes.


    Most of work in Huffman encoding is IMO computing the actual codes (not only code lengths), shifting them and inserting into bitstream.

  12. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    What do you mean by "translating"? Assuming you have a table which maps a symbol to code length it's trivial to map symbol to its code length After that mapping you do prefix sum.


    Even if you first split by blocks, you need to compute prefix sum on blocks sizes.
    You don't just need to map symbols to code lengths, you also have to know exactly which symbols are in a particular prefix. By the time you add that up, maybe you've already blown the speedup?


    Most of work in Huffman encoding is IMO computing the actual codes (not only code lengths), shifting them and inserting into bitstream.
    Isn't that all of the work? What else is there?

  13. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I think you'd want the length of the output to be predictable based on only the length of the input, not its contents. If processor #2 needs too much information from processor #1, the parallelism is lost. 1:1 wouldn't be an absolute requirement for parallelism in general. For lossless compression, the transform could not guarantee to take any less than equal space.
    Yes, but if you want the length of output to be predictable based on only the length of the input, how can compression be performed in parallel with sorting since it is not possible to always predict an output that is shorter than the input?

    I think Piotr's point is that the computation of the sum of the code lengths for a block of data takes much less time than generating the bitstream. If this is done in advance, the bitstreams can be processed in parallel, which would save more time than the pre-calculation of the output size of each block costs. Piotr didn't mention it, but the sizes of the blocks could also be calculated in parallel.
    Last edited by Kennon Conrad; 25th July 2014 at 05:06.

  14. #14
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, but if you want the length of output to be predictable based on only the length of the input, how can compression be performed in parallel with sorting since it is not possible to always predict an output that is shorter than the input?
    I meant that compressing while sorting could make it faster on a regular CPU. You don't have to resort to parallel processing unless you're out of ideas for major improvements serially.

    I think Piotr's point is that the computation of the sum of the code lengths for a block of data takes much less time than generating the bitstream.
    Does it?

    If this is done in advance, the bitstreams can be processed in parallel, which would save more time than the pre-calculation of the output size of each block costs. Piotr didn't mention it, but the sizes of the blocks could also be calculated in parallel.

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Prefix sum on code lengths shouldn't be much slower than eg prefix sum on symbols values (ie verbatim input). After all, it's just one LUT access, and whole LUT can reside in fast caches.

    Prefix sum algorithms are now highly tuned and known to run crazy fast on GPUs. Look:
    - http://http.developer.nvidia.com/GPU...ems3_ch39.html (20x speedup, 8800 GTX vs single core of Core 2 Duo Extreme),
    - http://nvlabs.github.io/cub/ (prefix scan -> tens of bilions of elements per second for prefix scan),
    - http://nvlabs.github.io/moderngpu/scan.html (prefix scan with throughput of hundreds of gigabytes),

  16. #16
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Piotr: maybe Huffman can be sped up on a GPU. You said it couldn't. I was just speculating on why.

  17. #17
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Did I?
    Quote Originally Posted by me
    I wonder why they say Huffman coding isn't well parallelizable. Firstly you do prefix sum on code lengths and then you have both offsets and codes so the problem becomes embarassingly parallel - at least for encoding.
    I think it's clear enough, unless my English is bad.

  18. #18
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I meant that compressing while sorting could make it faster on a regular CPU. You don't have to resort to parallel processing unless you're out of ideas for major improvements serially.
    Then you have me confused. I do not know how you can compress while sorting and maintain a 1:1 input to output size ratio.

    Personally, I am more interested in using multiple processors serially and on locally parallelizable processes. For instance, one to get the code lengths and another to generate the codes from the code lengths, or to run multiple processors to update context maps. I think you could do some pretty amazing things with FPGAs.

    Quote Originally Posted by nburns View Post
    Does it?
    In my experience, yes. YMMV.
    Last edited by Kennon Conrad; 25th July 2014 at 19:49.

  19. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Did I?

    I think it's clear enough, unless my English is bad.
    I was speculating on why "they" say Huffman coding isn't parallelizable.

  20. #20
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Then you have me confused. I do not know how you can compress while sorting and maintain a 1:1 input to output size ratio.
    They're two different scenarios. 1:1 is if you are using a GPU. Compressing while sorting is if you are using a regular single-threaded CPU.

  21. #21
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    They're two different scenarios. 1:1 is if you are using a GPU. Compressing while sorting is if you are using a regular single-threaded CPU.
    Ok. That's not nearly as interesting to me as a multi threaded sort and compress.

Similar Threads

  1. LZSS v0.01 is here!
    By encode in forum Data Compression
    Replies: 67
    Last Post: 28th March 2012, 10:10
  2. CUDA anyone?
    By Mexxi in forum Data Compression
    Replies: 75
    Last Post: 23rd February 2012, 14:13
  3. CUDA/GPU-based BWT/ST4 sorting
    By inikep in forum Data Compression
    Replies: 152
    Last Post: 7th December 2011, 20:26
  4. CUDA Technology of nVIDIA for Compression?
    By Stephan Busch in forum Data Compression
    Replies: 13
    Last Post: 17th September 2008, 21:44
  5. Cuda massive Multithreating
    By thometal in forum Forum Archive
    Replies: 2
    Last Post: 18th February 2007, 23:49

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •