Results 1 to 14 of 14

Thread: GPU compression again

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    GPU compression again

    Code:
    <schnaader>
      Got two machines here. First is 800 MHz AMD, 256 MB RAM :(
      second one is 2,2 GHZ Celeron, 1 GB RAM. I plan to
      upgrade to a 3 GHz single-core machine soon and will
      invest in a nice quad-core machine with enough RAM and a
      nice GPU card some time after that to get rid of
      "my-machine-sucks"-problems :)
      and of course, to be able to get some experience in
      multithreading and work on GPU. Precomp would get a nice
      speed boost running on 4 cores and stopping temporary file
      usage
    <toffer> 
      well multithreading is nice indeed
      i cut down the time eaten up by my genetic optimizer by a
      factor of 3, using 4 threads
    <schnaader> 
      Yes, it's useful at least if you can parallelize the
      algorithm - I guess precomp could get almost linear
      speed-up, as you'll get thousands of independant
      (de)compression tasks
      if there'd be some deflate implementation on GPUs, you
      could even get it to lightspeed, but I guess there's too
      much I/O involved for GPUs...
    <toffer> 
      lz on a gpu
      weird stuff
    <schnaader> 
      weird and useless... doesn't matter if you get 1 or 10
      GB/s if your disk speed is only 100 MB/s :)
    <toffer>
      yep
      i didn't took that into account
    <Shelwien> 
      as to deflate on GPU
      it should be easy at least with Cuda
      you just compile the plain C/C++ code for GPU
      and dispatch it to run on GPU from the main unit
    <schnaader> 
      it should be easy that way, yes, but would it get faster, too?
    <Shelwien> 
      now, that's really unlikely ;)
    <schnaader> 
      most compression discussion that involved GPUs I saw
      said something like "There's too much I/O involved, this
      won't get you anywhere"...
    <Shelwien> 
      GPU is not much faster than core2quad even on dumb
      parallel tasks like password cracking
    <schnaader> 
      Huh? IIRC, especially on password cracking or things
      like computing md5 hashes it's much faster, like 10x
      speedup
    <Shelwien> 
      http://3.14.by/en/read/md5_benchmark
      no, alas
    <schnaader> 
      Elcomsoft f.e. does passowrd cracking things on GPUs very successful
    <Shelwien> 
      this is much faster than elcomsoft implementation
      you can see elcomsoft impl there, in fact
    <schnaader>
      OK... nice one
    <Shelwien> 
      anyway, its clock is lower than x86
      something around 1Ghz afair
      and memory access takes ~100 clocks
      and all is memory, other than 8k CPU registers
      and then, what's even more important for LZ
      GPU really hates branches in the code
    <schnaader> 
      Ah, I see. That's why they are better for straightahead
      tasks like video compression
    <Shelwien> 
      it seems that it doesn't really process that many
      threads in parallel, its more like automatic
      vectorization than lots of independent parallel cores.
      So if you'd try to process 100 blocks of arithmetic
      instructions on registers, it'd really run 100x faster,
      but if you'd make these blocks unsync with branches etc
      so that instruction pointers in all threads would be
      different,
      then it would be only 8x faster or something
      depending on how many real physical cores there are
    <schnaader> 
      of course, even having the same speed as your CPU would
      still be nice as you can use both CPU+GPU at the same
      time :)
    <Shelwien> 
      in other words, its like multicore + hyperthreading
      but a little more cores and physical threads
      and btw there're virtual threads too
    
      As to "having the same speed as your CPU would still be nice" -
      yes, but that isn't usually worth the work. especially
      taking into account that for ATI cards you need to repeat
      that again
    <schnaader> 
      yes, the incompatibility is one the main disadvantages atm
      let's hope things get better. CUDA already is a nice
      step towards better usability, hopefully OpenCL will
      solve some problems, too.
    <Shelwien> 
      http://nsa.unaligned.org/index.php
    <schnaader> 
      yeah, of course nothing beats FPGAs.
    <Shelwien> 
      that's not FPGA
      that's FPGA-controlled bunch of cheap GPUs
      with plain FPGA, unfortunately, its very hard to beat core2
      because cheap boards like like at 30mhz, so even if you
      make it very parallel it won't really help
      and number of elements there is very limited too
      so not like too many threads are possible

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    30 mhz?
    Or maybe 1066?
    It may not be a cheap one and I've heard that max speed depends on algorithm, but somehow I don't believe that the difference is so big.

    Anyway, it's won't become mainstream in foreseeable future.

    And I'd like to add that there is a GPU based compressor around. flacuda. Compresses audio.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > 30 mhz?
    > Or maybe 1066?
    > It may not be a cheap one and I've heard that max speed
    > depends on algorithm, but somehow I don't believe that the
    > difference is so big.

    $2000-$11000 without tax? +$1000-$3000 design software?
    I really think that using existing hardware would result
    in better computing power per $ rate.

    > And I'd like to add that there is a GPU based compressor
    > around. flacuda. Compresses audio.

    Yeah, http://cuetools.net/doku.php/flacuda
    Am I right and they are comparing it to single-threaded
    version? And what about CPU/GPU clocks? default?
    Also, there were h264 CUDA implementations long before that.

    So, it seems that for bruteforce algorithms
    (that applies to flac as well, afaik)
    GPU can provide the same or a little higher computing speed
    as an overclocked core2quad cpu.
    But for LZ algorithms that should be different, as their
    bottleneck is memory access speed even on x86.

    Also, dumb CUDA ports do not require any special skills.
    Add a __device__ tag to a C++ processing function,
    a couple of cudaMalloc/cudaMemcpy calls, a call to function
    on GPU - and you'd have a GPU app.
    But making it to do something useful - now, that's really
    difficult.

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    Also, there were h264 CUDA implementations long before that.
    Correct.
    Though I don't usually think about lossy compression. There's too much mess with quality control and comparisons, almost all claims are unverifiable.
    Quote Originally Posted by Shelwien View Post
    Am I right and they are comparing it to single-threaded version? And what about CPU/GPU clocks? default?
    Yes, they compare to a single-threaded version. At the time when FLACuda launched there was no multithreaded FLAC encoder.
    Recently fpFLAC came out. On a modern quad core, it's bottlenecked by HDD even in the strongest mode.
    FLACuda offers more brute-force options and has better I/O, so it won the only comparison that I've seen, but it doesn't really tell which one computes better.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Isn't LZO already as fast as disk I/O without GPU?

  7. #7
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    The 1st successful implementation of LZ on GPU that I heard about. Interesting...though it would be better to have some results. Anybody with ACM access?

    Quote Originally Posted by Matt Mahoney View Post
    Isn't LZO already as fast as disk I/O without GPU?
    1. Disk is not all that there is. There are many memory to memory uses.
    2. Disks are dead and SSDs are often enough faster than LZO.
    3. Doesn't matter in this application nowadays, but may in the future: embedded systems often have weak CPUs

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    LZO has many modes, some are very slow but compress better. anyway, it's interesting result. but they aren't first: winzip can use AMD GPU for compression

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Disks are not dead. Disk gives you 10 times more storage than SSD for the same cost. If you can compress 90% at disk I/O speed, then it's an even tradeoff. Otherwise SSD is just another level of cache.

  10. #10
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    Yup. Fast and slow compression and memory schemes both have their place. Intrinsic may have commented on this, but doesn't ZFS support lzo ?
    And disks are still very important for microcontroller arena. I stand by my original post.

  11. #11
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    A specialized compression algo for double precision floating point numbers accelerated on GPU:
    http://cs.txstate.edu/~burtscher/research/GFC/
    Now if it is possible to detect sequences of embedded floating point numbers by doing some kind of data analysis than this can be transparently applied.

    A GPU accelerated hashing library using OpenCL:
    http://hashcat.net/oclhashcat-plus/
    Speed comparison here: http://www.codinghorror.com/blog/2012/04/speed-hashing.html

  12. #12
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Karhunen View Post
    Yup. Fast and slow compression and memory schemes both have their place. Intrinsic may have commented on this, but doesn't ZFS support lzo ?
    No. ZLE and LZJB. There are out of tree patches with LZ4 (integration pending) and LZF. ADDED: And there's deflate.
    Quote Originally Posted by Matt Mahoney View Post
    Disks are not dead. Disk gives you 10 times more storage than SSD for the same cost.
    Uses where price/GB is the diminant factor quickly disappear.
    Home users increasingly move their data to 3rd parties in the so called 'cloud'. Good majority of them doesn't hold much data anyway which makes the cost difference low with performance difference being huge.
    In the server market it's all about virtualisation which puts much greater demand on random access performance and HDDs offer lower IOPS/$, especially when taking TCO into account (drives eat more power and fail more often).
    Disks are big in the backup market, similarly with archival. Their streaming performance is OK, so they work well in a subset of HPC aplications just as well as in video.
    You speak about another level of cache? In some way, yes, different uses have different performance needs. SSDs have been in the most demanding ones for decades. They have been shifting down the performance curve and became mainstream already...and the shift continues.
    Zsolt Kerekes, who I regard as a very knowledgable person when it comes to SSDs predicts they'll go all the way down to backup and is confident enough to create a category for them on his site.
    So back to my original statement that disks are dead:
    They are not useless and will not be for as long as I can predict, yet their area of use shrinks fast and with an increasing pace and I expect them to become a niche technology in the next decade or so...and I give them so much time only because the HDD ecosystem has been so huge and lasted so long.
    There's still a ton of HDD legacy encumbering SSDs: Almost all software out there has been designed with a mindset that I/O is costly. It should be avoided, it should be linear, it should be in large chunks. Hardware? I'm sure I miss a lot, but I see signs if inoptimalities here too:
    * SATA / SAS devices are always far from the CPU. FC too, though to lesser extent.
    * They don't support ops smaller that 512B. Fusion-IO got billion 64B IOPS earlier this year with 8 servers and 64 SSDs.
    * There's no standard for storage other than SATA / SAS / FS, all PCI-E are proprietary with limited software support
    Hard disks ecosystem is still more mature. Consumer SSDs have been flaky to date. There are no good SSD benchmarks. There are few data recovery services.
    And hard disks still have a huge mind share. People still think that SSDs are expensive, though their disks lay empty. Developers still optimise their code for HDDs more often than for SSDs. In Polish language the common word for HDD ('dysk') has became symonymous with 'storage device' and SSD are often called incorrectly 'dysk SSD' or 'dysk flash'.
    All these things change in favour or SSDs. And the changes will only accelerate SSD adoption.
    Last edited by m^2; 5th January 2013 at 10:55.

  13. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    In Polish language the common word for HDD ('dysk') has became symonymous with 'storage device' and SSD are often called incorrectly 'dysk SSD' or 'dysk flash'.
    Well. "Drive" is generally used to describe something that moves, so the term "solid state drive" also seems incorrect for me.

    SSD isn't all roses. Writes and reads are super fast, but erasing is slow and needs to be done in relatively large sectors. And before writing we need to erase - but if we wrote to a half of erased sector, then we don't need to erase the sector again to write to the other half. Additionally there is a rather low limit of erasings so sophisticated controllers are designed to spread the wear.

    Uses where price/GB is the diminant factor quickly disappear.
    I disagree. Cameras and displays resolutions are increasing, thus videos and pictures sizes are getting larger. Network speeds are also increasing, thus people download more. Programs are getting larger, games are getting larger, etc And cheap disk space makes lossless formats an interesting option, escpecially those for sound data.
    Last edited by Piotr Tarsa; 3rd January 2013 at 22:23.

  14. #14
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    With a help from a friend I got access to the paper. Not very inspiring, there's a lot of generalities and very few details.
    Among other things, there's no scientific benchmark. They do give numbers, but omit important details like how did they get them.
    Anyway, they give hardware details. NVIDIA GTX580 was 20% faster than Core i7-2600. Judging by the numbers, 20% faster than a single thread of Core i7-2600. And there's no comparison of strength.
    So no, it doesn't seem to be a successful implementation.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •