Results 1 to 12 of 12

Thread: what`s wrong with writing simple gpgpu packer any working exampl? 8-) (user question

  1. #1
    Member
    Join Date
    Jul 2014
    Location
    Mars
    Posts
    164
    Thanks
    115
    Thanked 10 Times in 9 Posts

    what`s wrong with writing simple gpgpu packer any working exampl? 8-) (user question

    all i`ve seen are sdks and papers papers without working binaries 8-)
    i mean there are password bruteforcing software, video encoding but no file compression related 8-(
    Last edited by necros; 25th September 2014 at 06:37.

  2. #2

  3. The Following 3 Users Say Thank You to Piotr Tarsa For This Useful Post:

    a902cd23 (25th September 2014),Bulat Ziganshin (12th May 2015),necros (4th June 2016)

  4. #3
    Member
    Join Date
    Jul 2014
    Location
    Mars
    Posts
    164
    Thanks
    115
    Thanked 10 Times in 9 Posts
    thanx 8-)

  5. #4
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    i might be wrong but i think the serial structure of most compression makes it hard to put into a massive multithreading model

    Video encoding and brute forcing is easier cause it can divide up the data into multiple individuel data's to work on (hej thread 1 do the first 1024 brute force tries. thread 2 do the 1024. thread 1gazzilion do your next gazzilant part of the tries)

  6. #5
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    i know only one good open source project
    in the field of compresson which use CUDA and this is BSC

    http://libbsc.com/

    gribok wrotes:
    ---
    1. You need an NVIDIA GPU with compute capability 1.1 or higher.
    2. You need 270.81 drivers at least.
    3. You need Microsoft Visual C++ 2010 SP1 Redistributable Packages.
    4. Currently CUDA acceleration is implemented only for forward ST5, ST6, ST7 and ST8 transforms.

    Usage:
    1. GPU compression uses 17x-19x GPU memory for single block. So use smaller block size.
    2. GPU compression is designed for ultra fast compression,
    so preprocessing like segmentation or reordering need to be disabled.
    I recommend to use something like "-b4p -m7f" or "-b8p -m7f"
    ---

    status: The latest stable version is 3.1.0, released 8 July 2012

    - in my tests ST5 and ST6 give the best results

    - ST5 and ST6 is implemented for GPU and for CPU too

    - ST4 is fast , but for now only implemented for CPU

    - it would be wonderful, if we would have a simplified source with only ST5 and/or ST6
    which we could implement into a archive-format like 7Zip or similar

    - i think the performance of the ST5/ST6/ST7 implementation on CUDA will even better,
    if we can use CUDA compute capability 2.0 and the actual CUDA version 6.5-14

    another interesting project which uses parallel compression , but not GPU/CUDA is plzip

    http://download.savannah.gnu.org/releases/lzip/plzip/

    the actual version 1.12 for now has no windows binary

    best regards

  7. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Unfortunately I've never managed to get the GPU part of libbsc to build, normally suffering errors in the back40 code. It's also unclear how you're meant to do this given no documentation, but even this on the command line fails:

    Code:
    nvcc -arch sm_20 -g -DLIBBSC_SORT_TRANSFORM_SUPPORT -DLIBBSC_CUDA_SUPPORT -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -O3 -DNDEBUG -c libbsc/st/st.cu
    libbsc/st/b40c/radix_sort/../radix_sort/problem_instance.cuh(331): error: expected a ")"
              detected during:
                instantiation of "cudaError_t b40c::radix_sort::Enactor::IteratePasses<ProblemInstance, PROBLEM_SIZE, BITS_REMAINING, CURRENT_BIT, CURRENT_PASS>::DispatchPass<TUNE_ARCH>(ProblemInstance &, b40c::radix_sort::Enactor &) [with ProblemInstance=b40c::radix_sort::ProblemInstance<b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>, int>, PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, CURRENT_PASS=0, TUNE_ARCH=200]"
    libbsc/st/b40c/radix_sort/enactor.cuh(246): here
                instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort<PROBLEM_SIZE,BITS_REMAINING,CURRENT_BIT,DoubleBuffer>(DoubleBuffer &, int, cudaStream_t, int, bool) [with PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>]"
    libbsc/st/st.cu(237): here
    
    libbsc/st/b40c/radix_sort/../radix_sort/problem_instance.cuh(331): warning: variable "spine_elements" is used before its value is set
              detected during:
                instantiation of "cudaError_t b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DispatchPass(int, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::UpsweepKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::SpineKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DownsweepKernelProps &, bool, b40c::radix_sort::DynamicSmemConfig, int) [with DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>, _SizeT=int]"
    libbsc/st/b40c/radix_sort/enactor.cuh(145): here
                instantiation of "cudaError_t b40c::radix_sort::Enactor::IteratePasses<ProblemInstance, PROBLEM_SIZE, BITS_REMAINING, CURRENT_BIT, CURRENT_PASS>::DispatchPass<TUNE_ARCH>(ProblemInstance &, b40c::radix_sort::Enactor &) [with ProblemInstance=b40c::radix_sort::ProblemInstance<b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>, int>, PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, CURRENT_PASS=0, TUNE_ARCH=200]"
    libbsc/st/b40c/radix_sort/enactor.cuh(246): here
                instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort<PROBLEM_SIZE,BITS_REMAINING,CURRENT_BIT,DoubleBuffer>(DoubleBuffer &, int, cudaStream_t, int, bool) [with PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=40, CURRENT_BIT=16, DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, b40c::util::NullType>]"
    libbsc/st/st.cu(237): here
    
    libbsc/st/b40c/radix_sort/../radix_sort/problem_instance.cuh(331): warning: variable "spine_elements" is used before its value is set
              detected during:
                instantiation of "cudaError_t b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DispatchPass(int, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::UpsweepKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::SpineKernelProps &, const b40c::radix_sort::ProblemInstance<DoubleBuffer, _SizeT>::DownsweepKernelProps &, bool, b40c::radix_sort::DynamicSmemConfig, int) [with DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, unsigned char>, _SizeT=int]"
    libbsc/st/b40c/radix_sort/enactor.cuh(145): here
                instantiation of "cudaError_t b40c::radix_sort::Enactor::IteratePasses<ProblemInstance, PROBLEM_SIZE, BITS_REMAINING, CURRENT_BIT, CURRENT_PASS>::DispatchPass<TUNE_ARCH>(ProblemInstance &, b40c::radix_sort::Enactor &) [with ProblemInstance=b40c::radix_sort::ProblemInstance<b40c::util::DoubleBuffer<unsigned long long, unsigned char>, int>, PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=64, CURRENT_BIT=0, CURRENT_PASS=0, TUNE_ARCH=200]"
    libbsc/st/b40c/radix_sort/enactor.cuh(246): here
                instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort<PROBLEM_SIZE,BITS_REMAINING,CURRENT_BIT,DoubleBuffer>(DoubleBuffer &, int, cudaStream_t, int, bool) [with PROBLEM_SIZE=b40c::radix_sort::LARGE_PROBLEM, BITS_REMAINING=64, CURRENT_BIT=0, DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, unsigned char>]"
    libbsc/st/b40c/radix_sort/enactor.cuh(289): here
                instantiation of "cudaError_t b40c::radix_sort::Enactor::Sort(DoubleBuffer &, int, cudaStream_t, int, bool) [with DoubleBuffer=b40c::util::DoubleBuffer<unsigned long long, unsigned char>]"
    libbsc/st/st.cu(315): here
    
    1 error detected in the compilation of "/tmp/tmpxft_00000ccc_00000000-4_st.cpp1.ii".

  8. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    gribok wrotes:
    ---
    3. You need Microsoft Visual C++ 2010 SP1 Redistributable Packages.


  9. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I was building under Linux. For windows I could just use the precompiled binaries anyway.

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it looks like there is no linux cuda support. at least author never thought about it

  11. #10
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    437
    Thanks
    1
    Thanked 96 Times in 57 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it looks like there is no linux cuda support. at least author never thought about it
    There actually is, it is working. However, I personally prefer OpenCL for that. However, GPU algorithms only work well if they are massively parallel, and here comes the actual problem into the play: Data compression, especially data modelling is inherently a serial process - you need to adapt your dictionary, need to take history into account. Of course, one can disregard such knowledge and split the data into independent blocks. However, that also means that you pay a price in terms of compression performance (you loose the co-information between the data, i.e. were you split it off).

    Data synchronization within a GPU only works within the same compute unit (i.e. within the same work-group in the language of OpenCL), i.e. there are no generic synchronization primitives.

    Of course, if you look into the works of Ana, specifically the parallel Huffman, you suffer from a two-pass approach where you first need to store bit-offsets along with a static(!) Huffman output, then a second pass plus atomic operations to insert the data into the right spot. I haven't made an experiment on specifically *this* approach, but I made a quite similar approach (static Huffman) for image compression, and the "serial fit-together" part of the algorithm was clearly the dominating (aka "slow") part of the algorithm.

    Data compression means, almost by definition of the problem, that you either compress fixed-length blocks to varying-length data (Huffman) or varying-length blocks to fixed-length data (Tunstall codes), both of which requires some kind of synchronization between the blocks.

    The other option is lossy compression where you could realize fixed-length to fixed-length codes (lossy, of course) where you again have a coding loss due to the lack of proper rate-distribution (this argument can also be made more precise, in the sense of fixed-length and variable length quantization, estimates on the loss are available). In reality it means that it simply doesn't perform well.

    Or, and that is the last option, you parallelize only a certain part of the codec, i.e. wavelet or DCT, probably Huffman per block. Whenever I tried that, the overhead of moving the data between GPU and CPU is the dominating factor, and not the compression itself.

    Either way, there is no silver bullet.

  12. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Data compression, especially data modelling is inherently a serial process - you need to adapt your dictionary, need to take history into account. Of course, one can disregard such knowledge and split the data into independent blocks. However, that also means that you pay a price in terms of compression performance (you loose the co-information between the data, i.e. were you split it off).
    I disagree. Let's take for example LZ77 based codecs. Matt Mahoney is already using block sorting to find matches and block sorting is already done using GPGPU in libbsc. What's missing is fitting it together (which is a tough task, but I don't see any blocker).

    Data synchronization within a GPU only works within the same compute unit (i.e. within the same work-group in the language of OpenCL), i.e. there are no generic synchronization primitives.
    There are global atomics, so you can synchronize data flow. For synchronization of control flow, you're left with synchronization within command queues. I don't know what OpenCL 2.0 brings in regards of synchronization facilities but probably it will allow much greater set of problems to be solved efficiently.

    Of course, if you look into the works of Ana, specifically the parallel Huffman, you suffer from a two-pass approach where you first need to store bit-offsets along with a static(!) Huffman output, then a second pass plus atomic operations to insert the data into the right spot. I haven't made an experiment on specifically *this* approach, but I made a quite similar approach (static Huffman) for image compression, and the "serial fit-together" part of the algorithm was clearly the dominating (aka "slow") part of the algorithm.

    Data compression means, almost by definition of the problem, that you either compress fixed-length blocks to varying-length data (Huffman) or varying-length blocks to fixed-length data (Tunstall codes), both of which requires some kind of synchronization between the blocks.
    I haven't really well studied Ana's paper, but for Huffman coding I think we can use a scheme which is efficient in regards to memory bandwidth usage.

    We can:
    - split uncompressed input into fixed size blocks,
    - compute histogram over whole input - there are many efficient solutions to do that, it can basically run at the speed of VRAM,
    - compute Huffman tree - I can't predict how long it would take, but if we're compressing hundreds of megabytes then if shouldn't matter much,
    - compute offsets of compressed blocks and do prefix sum of them - it will again require scanning the whole input,
    - compress data - again it could be done at the speed of VRAM, but now we have not only to read but also to write,

    Consequently, it seems possible to do Huffman coding at 1/4 of the speed of VRAM.

    Concurrent writes to the same buffer require synchronization, but concurrent reads don't. We can implement Huffman where each thread writes to non-overlapping region of memory. To achieve this we need the offsets of compressed blocks (I've included them in scheme above) and resolve the problem of Huffman codes crossing the block boundaries by additional loops that hoists a few symbols from previous block. This way we don't need to synchronize writes, but we need additional loops for special handling of the beginning of blocks. We also need to be careful not to write to next block during encoding the last symbol.

    Update:
    Computing offsets of blocks can be done faster. While computing global histogram, we can store a histogram per block (but that would require having bigger blocks to provide a gain) so then we can compute block offsets from the histograms and code lengths alone, reducing the memory bandwidth requirement for that step. That would bring the total speed closer to 1/3 of VRAM speed.

    The other option is lossy compression where you could realize fixed-length to fixed-length codes (lossy, of course) where you again have a coding loss due to the lack of proper rate-distribution (this argument can also be made more precise, in the sense of fixed-length and variable length quantization, estimates on the loss are available). In reality it means that it simply doesn't perform well.
    I can think of lossy texture compression on GPGPU which does multiple passes to select the best local methods in every pixel block. Probably that's already done with the DXTC compression schemes.

    nVidia Maxwell architecture achieves great efficienty with narrow memory interfaces thanks to novel techniques of lossless image buffers compression. So various forms of image compression on GPGPUs are doable.

    Or, and that is the last option, you parallelize only a certain part of the codec, i.e. wavelet or DCT, probably Huffman per block. Whenever I tried that, the overhead of moving the data between GPU and CPU is the dominating factor, and not the compression itself.
    You can interleave transfer and computation, by dividing work into groups and processing them asynchronously. This way you would have computation and transfer in both ways happening at the same time. PCI-Express is full duplex, meaning that uplink and downlink are independent and can be simultaneously saturated.

    There's also HSA standard and devices where CPU and GPGPU fully share global memory.
    Last edited by Piotr Tarsa; 3rd October 2014 at 20:43.

  13. #12
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it looks like there is no linux cuda support. at least author never thought about it
    Linux itself has cuda support and NVidia release "nvcc" compiler. It just seems that for whatever reason it doesn't like the back40 component so well, giving a very cryptic missing bracket error (which isn't obviously true, but with templates who knows!). The latest back40 code seems to have changed radically and suffers other build problems; besides the back40computing project itself has abandoned it and switched to a newer library (http://nvlabs.github.io/cub/).

    Anyway this thread isn't really about BSC, but about practical code examples of doing data compression in a GPU. My post was simply to refute the claim that BSC is suitable as "a simple gpgpu packer". We still need simple examples.

    Probably the best starting point would be the general libraries ("cub" above), or for the more advanced, pulling apart the nvBWT component of nvBio (http://nvlabs.github.io/nvbio/nvbwt_page.html). The latter is specifically designed around BWT of DNA sequences for alignment purposes (eg nvBowtie), but the block sorting components should be transferable.

Similar Threads

  1. Writing GUI in PowerShell
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 4
    Last Post: 11th August 2013, 21:04
  2. user - video compression.
    By WebWalker in forum Data Compression
    Replies: 15
    Last Post: 8th July 2013, 17:13
  3. Did anyone combine GPGPU and CM?
    By Piotr Tarsa in forum Data Compression
    Replies: 12
    Last Post: 31st December 2012, 09:08
  4. GPGPU computing
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 19th February 2012, 17:10
  5. What's wrong with my testing script?
    By m^2 in forum Data Compression
    Replies: 20
    Last Post: 21st September 2008, 19:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •