Results 1 to 15 of 15

Thread: Mass parallel (GPGPU) compression algos

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    Mass parallel (GPGPU) compression algos

    Let's discuss what kinds of lossless compression algos (or their parts) can be made faster employing mass-parallel computers and in particular modern GPUs

    I don't have good knowledge of modern GPUs (still have to read the book), but afaik they have dozens of simultaneously executed threads (like CPU cores) with dozens of numbers processed by each thread simultanelously in SIMD fashion. also, they support dozens of ready-to-run thread per each execution thread, like the HyperThreading cpu technology

    taking into account execution delays and especially memory access delays (partially hidden by 16-48kb cache), GPU may be treated as cpu executing hundreds of threads simultaneously with wide SIMD device on each thread. so efficient GPU execution needs algorithms with the following properties:

    1) operation may be split into hundreds of independent tasks, or alternatively - into dozens tasks accessing only 16-48 kb of data
    2) data can be processed in SIMD fashion with dozens of parallel items

    computations meeting both criterions are ideal for GPUs. but probably any computation that may be split into 100+ independent jobs, will be faster on GPU since cpu can't provide such level of parallelism. and vice versa - computation that cannpot be split into more than 16 jobs, probably cannot benefit from GU executon since modern cpus can execute 8 threads simultabeously, has pretty wide SIMD devices and pretty the same cache hiearachy as GPUs. well, cpu memory/cache is faster while gpu memory/cache is wider, so bandwidth-hungry algorithm will be ideal for GPU, too

    so, basically we just need to select algos that may be split into 100+ independent jobs. i will name a few:

    1. static bit-precise (huffman) encoding, followed by the m/t shift into destination positions
    2. any other static-model encoding such as PPM, with each next block started from the next bit/byte. decompression can be made m/t too
    3. static-dictionary LZ/dictionary replacement
    4. CM encoding with multiple models running simultaneously

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    A typical GPU doesn't have hundreds of compute units. High end ones have tens of them, while low end ones have just few. Each compute unit is a wide SIMD with fast task switching to work around memory latencies. (I'm abusing terminology there somewhat, but I want to pinpoint that GPGPUs can't execute hundreds of different instructions at once).

    If you only parallelize as much as to keep each compute unit busy but you don't parallelize as much as to utilize full SIMD widths (I mean - you don't use vector instructions, only serial ones) then you probably will have a slowdown, compared to multicore CPU.

    Another disadvantage of GPUs is that the ratio of threads per total cache size is very bad compared to CPUs. Generally on GPUs it works out to maybe a few kilobytes of cache space per thread, where on CPUs you often have over a megabyte. If you run ouf of cache space then you'll access global memory, which will reduce the efficiency greatly. Even the last CPU caches have plenty of bandwidth and with CPUs you don't have to deal with overhead from adapting an algorithm to parallel environment so you can end up with mediocre improvements if you use global VRAM too much.




    I think one can do a sensible LZ-like algo encoding on GPGPU. First the format description: it will be LZSS-like, but where each token is 4-bytes. A bit 0 will mean a verbatim 4-byte chunk, while bit 1 will mean an 16-bit offset to some previous repetition.
    Then an algo:
    - start with stable radix sort to do a block sorting on first 4 symbols of each suffix. Store tuples (prefix, position) (the prefix is the sorting key)
    - now, for each position divisible by 4 look for previous tuple - if the prefix is the same and position is in distance smaller than 2^16 bytes then we have a match and we store it to auxiliary array (also the flag is stored in another auxiliary array),
    - if the prefix is different or position is too distant we only store the flag (offset is irrelevant so we can skip setting it),
    - after that we need to do prefix sum on flags to determine final positions of offsets and at that point compression is almost done,

    Algo is fully parallel and can use big blocks as radix sorts can work on big blocks efficiently even on GPGPUs.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    a bit more on cpus and gpus. in the last 5 years intel's $300-class cpu is pretty the same - 4c/8t (4 cores, 8 threads). but it is driven by marketing, not technology. intel sells 15c/30t@2.5GHz cpus that is pretty natural increase taking into account 45nm->22nm transition

    ibm power8 is 12c/96t@3.5 and sparc t5 is 16c/128t@3.6. intel xeon phi is 60c/240t@1GHz. just for reference

    the first ATI GPU with unified shaders, 2900XT released at may 2007, had 320 ALUs running at 740 MHz. the last one, 290x@oct2013 has 2816 ALUs @1GHz. i.e. 10x increase in 7 years

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I have a sandy bridge CPU. Can you do any computation on the integrated video?

  5. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    This is way outside of my area of knowledge, but found this entry from wikipedia to be interesting:

     
    [TR]
    [TD]GPU-HMMER[/TD]
    [TD]Parallelized local and global search with profile Hidden Markov models[/TD]
    [TD]Parallel local and global search of Hidden Markov Models[/TD]
    [TD]60–100x[/TD]
    [TD]T 2075, 2090, K10, K20, K20X[/TD]
    [TD="class: table-yes, align: center"]Yes[/TD]
    [TD]Available now Version 2.3.2[/TD]
    [/TR]
    It appears they are claiming a 60-100x speed improvement over a multi-core x86.

    Edit: That didn't come through well. The original page is here: http://en.wikipedia.org/wiki/GPGPU#C...onal_resources.

  6. #6
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    437
    Thanks
    1
    Thanked 96 Times in 57 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Let's discuss what kinds of lossless compression algos (or their parts) can be made faster employing mass-parallel computers and in particular modern GPUs I don't have good knowledge of modern GPUs (still have to read the book), but afaik they have dozens of simultaneously executed threads (like CPU cores) with dozens of numbers processed by each thread simultanelously in SIMD fashion. also, they support dozens of ready-to-run thread per each execution thread, like the HyperThreading cpu technology
    Unfortunately, that's quite hard. A couple of years ago compression algorithms for parallel architectures had been discussed in the scientific community, though most of them rely on the type of parallelism CPUs have to offer. For GPUs, the story is quite a different one. Even though GPUs have many parallel compute units, the type of parallelism they offer is quite different from CPUs. The compute units are scalar units of a vector processor, and each GPU only has a small number of vector processors (four or eight) similar to the small number of CPU cores you have on a modern multi-core processor. The scalar units of the GPU vector processor all operate in parallel, executing the same operations - which means specifically that you cannot "branch" within a single vector unit. Or actually, you can, but it's emulated by having all options of a conditional expression executed, but the results thrown away. Algorithms on vector processors hence should avoid branches to be as fast as possible, not exactly an ideal condition for data compression.
    Quote Originally Posted by Bulat Ziganshin View Post
    1) operation may be split into hundreds of independent tasks, or alternatively - into dozens tasks accessing only 16-48 kb of data
    Typically called "slicing". Yes. Unfortunately, it is hard to exchange data between the threads since they run simultaneously (or emulated to run simultaneously), that's quite the point. The problem with that is, from a compression point of view, that you loose the ability to exploit data depenencies between the slices, or more scientifically, your compression performance looses the co-information between the slices. This loss can be computed precisely, it is as said the co-information. (Not hard to do that).
    Quote Originally Posted by Bulat Ziganshin View Post
    2) data can be processed in SIMD fashion with dozens of parallel items
    Which imposes another limitation, namely that of arranging the input or output data. For lossless compression, either input or output must be varying length - as you want to represent the more likely symbols with smaller bitstrings. (There are more options for lossy, like fixed length quantization). For fixed to varyable length, e.g Huffman, you need to have a unit that fits the varying length outputs together, hence operates in serial. For varyable length to fixed length, e.g Tunstall, you need to split up the input into unequal segment sizes without actually compressing the input, hence you can only operate approximately and will necessarily compress some symbols twice, i.e. you loose compression performance, at least. It is then still unclear how to arrange the input such that the output can fit together.
    Quote Originally Posted by Bulat Ziganshin View Post
    1. static bit-precise (huffman) encoding, followed by the m/t shift into destination positions
    See option 1, fixed to variable length coding. Option 2 also exists. Problem is the serial operation to merge the output of the units. In my experiments (that was lossy and lossless image coding) that part ruined the performance of the parallel part. Actually, the whole need to move the data onto the GPU and off the GPU ruined the performance. Actually, it is even a bit more complicated: You usually do not want to use a fixed slicing for all parts of a full algorithm (e.g. image compression consists of transformation, quantization, compression) but for optimal performance you should vary the size of the slices depending of the algorithm you are performing. This has however also the cost that you need to buffer data. There is no synchronization primitive between threads in OpenCL or CUDA, simply because you do not know whether the threads a performing in parallel (on different vector processors) or in serial (on the same vector processor). That's up to the graphics card driver to decide.
    Quote Originally Posted by Bulat Ziganshin View Post
    2. any other static-model encoding such as PPM, with each next block started from the next bit/byte. decompression can be made m/t too 3. static-dictionary LZ/dictionary replacement
    I haven't looked into these basically because they are not relevant for images. Problem with LZ variants is again that they are intrinsically serial as LZ learns its own dictionary, i.e. you have a dependency of the algorithm that goes past in time. That's pretty bad for parallel algorithms, and you need somehow to break that up, for example "coloring" the input (assign a label to each letter) and only touch the letters of the same color on each thread. Again, comes with a price on compression performance.
    Quote Originally Posted by Bulat Ziganshin View Post
    4. CM encoding with multiple models running simultaneously
    That's probably your best bet because you can try so many things in parallel. Each individual algorithm won't be fast (the scalar units of GPUs are *not* fast, you have only many of them to reach the performance of the GPU), but since you can try so many at once, you can probably get a very good compression performance with tons of serial algorithms tried at the same time, and throwing the output of most of them away. It sounds wasteful, but it's a common strategy for GPU programming. Unfortunately, wasn't quite an option for classical image compression. Something like fractal compression might work well on GPUs since you have to try very often to find a "match" in the image. Fractal compression has however quite a lot of drawbacks, as having no reasonable "quality control" by quantization.

  7. The Following User Says Thank You to thorfdbg For This Useful Post:

    Bulat Ziganshin (26th July 2014)

  8. #7
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    There are also cases where you have many independent compression jobs that you can vectorise. Example: ZFS writing 1 GB/s compresses no less than 8000 independent blocks per second. They get written to HDD in transactions every 10 seconds, so you have plenty of work that you can queue freely and process in batches as big as you wish.
    Last edited by m^2; 26th July 2014 at 13:35. Reason: Corrected arithmetic

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    thorfdbg, thanks for thorough analysis. again, i'm not yet learned GPU intrinsincs. but it seems that they are constructed from SIMD-like devices, so here is some parity with cpus: algos employing SIMD will be much faster both on cpu and GPU, and non-SIMD will be slow on both devices. the main difference is that GPU can run hundreds of threads simultaneously, hiding memory latencies. also, support for hundreds of threads starting to appear in cpus too. i really wonder why Intel doesn't sell SIMD-less version of Xeon Phi as rival to Oracle T series

    i considered only algos that doesn't need to exhange information between threads. in particular, huffman encoding can be done by splitting input data into 1K byte blocks, simultaneously encoding them, then calculating offset for each encoded block in the combined output in O(log N), and then simultaneously moving each encoded block into its final position

    of course, that leaves the problem of parallel decoding - it will require saved position of each subblock and since it's quite many info (2-4 bytes), we can simplify the task by starting each compressed block from the byte boundary. the same approach may be used to save ari-encoded data


    m^2
    , so we need algos that either doesn't learn (static models) or learn so fast that they can be used on small datasets (although 2GB/256 threads = 8MB/thread)
    Last edited by Bulat Ziganshin; 26th July 2014 at 13:19.

  10. #9
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Static models won't work in something as generic as a filesystem. But fast learning? You have 128 KB of data in each chunk, that's plenty.

  11. #10
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    the main difference is that GPU can run hundreds of threads simultaneously, hiding memory latencies. also, support for hundreds of threads starting to appear in cpus too. i really wonder why Intel doesn't sell SIMD-less version of Xeon Phi as rival to Oracle T series
    CPUs hide memory latencies by out-of-order execution and speculative execution (yes, speculative execution can hide memory latencies as often you do jumps based on not-yet-cached memory content). OTOH, running lots of threads means the cache is effectively divided among them, reducing the cache hit ratio.
    It's all tradeoff, but Intel's CPUs are designed for high single-thread speed and many other CPUs are designed for high multi-thread speed.

  12. #11
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    437
    Thanks
    1
    Thanked 96 Times in 57 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    thorfdbg, thanks for thorough analysis. again, i'm not yet learned GPU intrinsincs. but it seems that they are constructed from SIMD-like devices, so here is some parity with cpus: algos employing SIMD will be much faster both on cpu and GPU, and non-SIMD will be slow on both devices. the main difference is that GPU can run hundreds of threads simultaneously, hiding memory latencies. also, support for hundreds of threads starting to appear in cpus too. i really wonder why Intel doesn't sell SIMD-less version of Xeon Phi as rival to Oracle T series
    Well, it's more complicated than that. On a CPU, you have SIMD instructions, usually taking input vectors of dimension four, and you have several CPU cores, typically four to sixteen nowadays. On a GPU, you have a triple scaling. First of all, you have SIMD instructions by themselves. They also take input vectors of dimension four, originally designed for typical vertex or fragment shader computations. Then you have multiple cores, similar to a CPU, also typically four to eight. However, you have a level in between: The vectors a typical vector core processes is much larger than the four elements you see from the SIMD instructions (actually, you never see the instructions, you only have the source code in OpenCL or CUDA that is compiled to graphics card specific code in run time). In essense, a single vector processor computeson vectors that are several hundred elements long, and all the constructions you see in OpenCL or CUDA are merely "emulations" to fit into this vector structure, i.e. a single "if" clause will always execute both branches on *all* processors, and then decide on the condition of the clause which results to use or not. Or to put it in different ways: All units of a single vector core always executes all the instructions it sees in a single step in parallel in all the units it has. You cannot tell a vector processor that one of its units should execute a different program than another unit. Thus, the more nested the program gets, the slower it becomes basically because all units have to go through all branches, and the sum of the size of all branches defines the amount of instructions the program has to go through. Similar with loops: If the loop condition is dependent on the vector processor, the longest possible loop defines how much time is spend in the loop. Loops not required for some units will be computed, but the output will be thrown away.
    Quote Originally Posted by Bulat Ziganshin View Post
    i considered only algos that doesn't need to exhange information between threads. in particular, huffman encoding can be done by splitting input data into 1K byte blocks, simultaneously encoding them, then calculating offset for each encoded block in the combined output in O(log N), and then simultaneously moving each encoded block into its final position
    That's pretty much what I have also tried, though the problem with that approach is that you need to bit-shift the outputs to their final position, i.e. you have a time-limiting post-processing algorithm that limits the execution speed. You can try to load that onto the GPU as well if you want, but in either case, it limits performance. The worst speed brake, however, was in my case simply the PCIe bus. It's not as "express" as one may think. For an average wavelet lifting, the time taken by the GPU to perform the lifting was approximately equal to the time required to get the data onto the GPU, and in that time, the CPU could have computed the lifting itself. All this GPU processing only makes sense if you can somehow hide the latency due to memory copy operations in other operations, i.e. if you compress several images in parallel, say from a motion sequence. In that case, part of the code can be busy moving data onto the GPU, while the GPU is still busy with the last frame, and another thread of the CPU is busy getting data back from the GPU from the previous to last frame. In such a construction, one can gain speedups. With just a single image to compress, I did not find much of an advantage. Linux was a bit faster on the GPU, Windows was a bit slower on the GPU, the latter likely because the number of legacy software layers the code has to go through just to get access to the GPU is too large.

  13. The Following User Says Thank You to thorfdbg For This Useful Post:

    Paul W. (27th July 2014)

  14. #12
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Not sure if this (old) paper has been noticed:
    http://repository.tamu.edu/bitstream...pdf?sequence=2

    It talks about parallel match finder and range coding for LZMA on GPUs.

  15. The Following User Says Thank You to moinakg For This Useful Post:

    Bulat Ziganshin (13th May 2015)

  16. #13
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    the main difference is that GPU can run hundreds of threads simultaneously, hiding memory latencies
    From my understanding a GPU still has only a few cores. An newer Intel Atom for example has between 1 and 8 CPU cores plus 2 GPGPU cores. If you start more than 2 threads on the GPU then only 2 are beeing executed in parallel. But the GPU can switch like the CPU does with hyper threading. If an operation stalls because of a memory access which can't be served immediately then it switches to a different thread.

  17. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    From my understanding, in the case of Radeon HD 7970, there are 32 compute units, each compute unit has 4 SIMDs with independent program counters and each SIMD has 16 ALUs that execute the same instruction. So 32 CUs * 4 SIMDs/ CU = 128 SIMDs and Radeon HD 7970 can be viewed as 128 core processor with 16 * 32 = 512 bit SIMDs. There are 10 program counters per SIMD (the SIMD chooses the next instruction from those 10 pointed by the program counters) , so overall there are 1280 different program counters present at the same time. But there's a restriction - a SIMD not only executes single instruction in all 16 ALUs, but also each instruction is always repeated 4 times and that's why the granularity of branching is 4 * 16 = 64 work items (it's called a wavefront). If you multiply 1280 program counters by 64 work-items per program counter then you get a maximum of 81920 threads in flight on a Radeon HD 7970. From my understanding, switching between threads is essentially free, provided that threads are already created, instructions are already cached and threads are queued in SIMDs.

    Single memory access on GPU has ridicuously long latencies. With a GPU clocked at 1 GHz, a VRAM access can take hundreds of clocks. Of course access to on-chip caches is substantially quicker. In order to mask VRAM latencies you have to run plenty of threads at the same time.

    One of the biggest advantages of GCN SIMD over x86 SIMD is that GPU has optimized scatter and gather instructions (each work-item can address memory independently, both for Local Data Storage and (possibly cached) VRAM), whereas x86 CPUs don't have scatter at all and gather is unoptimized and present only in newest iterations of AVX.


    But the GPU can switch like the CPU does with hyper threading. If an operation stalls because of a memory access which can't be served immediately then it switches to a different thread.
    From my understanding, HyperThreading, at least in the Haswell, doesn't work like that. SPARC Niagara works like that, but in Haswell it works differently. Each Haswell core has two frontends that issue instructions to shared pipelines and the thread that issues a microoperation to a pipeline (or port?) first wins. Blocking can happen due to any bottleneck, not only stalls on memory. Eg if two threads want to issue a total of 5 additions at once, but CPU core has pipelines (ports?) only for 4 additions then one thread stalls. You can't control which one stalls, even by changing privilege levels or other operating system level properties.
    Last edited by Piotr Tarsa; 19th November 2014 at 14:50.

  18. The Following 3 Users Say Thank You to Piotr Tarsa For This Useful Post:

    Bulat Ziganshin (19th November 2014),just a worm (20th November 2014),Paul W. (19th November 2014)

  19. #15
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    .
    Last edited by just a worm; 20th November 2014 at 14:49.

Similar Threads

  1. Did anyone combine GPGPU and CM?
    By Piotr Tarsa in forum Data Compression
    Replies: 12
    Last Post: 31st December 2012, 09:08
  2. GPGPU computing
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 19th February 2012, 17:10
  3. A paper on GPGPU Huffman
    By m^2 in forum Data Compression
    Replies: 3
    Last Post: 15th July 2011, 02:02
  4. parallel compression with batches
    By evg in forum Data Compression
    Replies: 4
    Last Post: 17th September 2009, 18:14
  5. LZTURBO 0.9 parallel compressor
    By donotdisturb in forum Forum Archive
    Replies: 18
    Last Post: 6th March 2008, 01:23

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •