Results 1 to 6 of 6

Thread: CM on GPU - I need ideas for parallelization

  1. #1
    Member
    Join Date
    Aug 2013
    Location
    Germany
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Lightbulb CM on GPU - I need ideas for parallelization

    I want to try creating a CM compressor and decompressor running on GPU's via Nvidia CUDA. Therefore I need some Ideas how I can realize the parallelism required. (Each instruction need to be executed in >1000 Threads in parallel. And no branching, because the instruction scheduler in each package of threads can only exexcute one common Instruction at a time.
    So just using the code we currently have (for example ZPAQ: one block for each of the ~1000 threads) on the GPU doesn't work.
    I can not guarantee to coplete the project, but I will try to.

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I have thought about that but I haven't come up with anything reasonable.

    However I think that GPGPUs still could be very useful in developing an efficient CM. Very often it is needed to tune constants in algorithm to obtain optimal performance on various kinds of data. Constants could be tuned using eg genetic algorithm or brute force algorithm. Usually, when tuning, the flow remains completely the same - the same branches are taken for each optimization pass. Therefore it should be suitable for running on GPGPUs.

    There's an additional catch here though - compression algorithms are bottlenecked to some extent by memory latencies, which aren't exactly better on GPGPUs than the latencies of usual DDR3 system memory (GDDR5 were designed for high bandwidth at the expense of latency - but as GPGPUs are highly parallel, the latencies can be masked by rapidly switching between many sets of threads). To reduce the amount of random memory accesses you could precompute the values that doesn't change when you tune constants. So eg you could serialize the inputs from models and use that when optimizing.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    btw, the bsc compressor runs on gpu using cuda, and i've found that it produces erroneous output rather frequently. it's ok for games but not for data compression

  4. #4
    Member
    Join Date
    Aug 2013
    Location
    Germany
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    btw, the bsc compressor runs on gpu using cuda, and i've found that it produces erroneous output rather frequently. it's ok for games but not for data compression
    Some GPUs are overclocked (ok, most GPUs are overclocked) and overclocking reduces stability. With that high performance redundant computations could be possible. Imagine: in the past, you had to wait 10h to compress the photos (raw format, converted to uncompressed DNG to allow a good compression). Now you can use the GPU. You could compress it in only half an hour, but with the risk of data errors. The redundent implementation for your GPU takes 2 hours with the same error risk as your CPU.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    yes, my gpu was fab-overcloked. so you take this into consideration, the rest is your work

  6. #6
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Whether you get errors or not depends on the class of GPUs. The typical consumer class of GPUs of the GT*** series are optimized for speed at the cost of correctness. In a game a few pixels here and there do not matter.
    On the other hand the Quadro class of GPUs (FireGL for ATI) are designed for correctness as they are targeted at the CAD space. The Tesla and Kepler GPUs from Nvidia take this one step further and those are targeted at data crunching rather than graphics. Correctness is a paramount factor there including ECC GDDR5 RAM. I have been working with three generations of Nvidia GPUs starting with Tesla S2050 to M2090 and now the Kepler K20m. The workloads involve large amount of data crunching both floating point and integer. No stray errors till date.

Similar Threads

  1. Huffman on GPU
    By Bulat Ziganshin in forum Data Compression
    Replies: 9
    Last Post: 23rd December 2017, 05:30
  2. GPU compression again
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 13th January 2013, 21:09
  3. CUDA/GPU-based BWT/ST4 sorting
    By inikep in forum Data Compression
    Replies: 152
    Last Post: 7th December 2011, 20:26
  4. More ideas for LZ matchfinding
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 2nd October 2011, 23:39
  5. Ideas for PIM
    By Mapi in forum Forum Archive
    Replies: 1
    Last Post: 18th June 2007, 22:24

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •