Results 1 to 14 of 14

Thread: OpenCL ST5 implementation

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts

    OpenCL ST5 implementation

    Hi,

    (I hope) I've implemented ST5 in OpenCL. As of now, CKE is not exploited (as I've written about here: http://encode.ru/threads/1274-Concur...mplementations ). I am not sure if I implemented it correctly.

    I'll be glad if you point me to some paper describing ST5 in detail or write algorithm yourself. If you know a good toolchain that does various stagers of BWT based compression post it here.

    Newest version of program is here: http://www63.zippyshare.com/v/93264665/file.html

    Usage:
    If you provide it with one parameter then it generates random data, computes ST5 and writes output to file. It will treat that parameter as output file name.
    If you provide it with two parameters, then first will be input file name and second will be output file name.
    In either case algorithm processes exactly 16 MiB of data, so you will have best results if you provide it a 16 MiB file.

    As for now it's slow as it's based on Bitonic Sort. It's not well suited to STx transforms as Bitonic Sort is unstable, so I had to append original index to each key, ie key consists of 5 bytes of input buffer + 3 bytes of input buffer position. Appending input position to key made Bitonic Sort stable. I'm planning to extend my program to compute full BWT transform, where stability of sorting is not required.

    I hope algorithm is good and provides reasonable results
    Last edited by Piotr Tarsa; 19th April 2011 at 23:37.

  2. #2
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Sorry, but it's not working on my computer (java version "1.6.0_24"):

    >java -Djava.library.path=./natives/win32/ -jar dist/StreamPacker.jar ENWIK16m ENWIK16m.st5
    Exception in thread "main" java.lang.OutOfMemoryError: Direct buffer memory
    at java.nio.Bits.reserveMemory(Unknown Source)
    at java.nio.DirectByteBuffer.<init>(Unknown Source)
    at java.nio.ByteBuffer.allocateDirect(Unknown Source)
    at com.jogamp.common.nio.Buffers.newDirectByteBuffer( Buffers.java:69)
    at com.jogamp.common.nio.Buffers.newDirectLongBuffer( Buffers.java:153)
    at streampacker.Main.main(Main.java:4

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    How much memory do you have? Currently StreamPacker requires about 300 M on host side (System RAM) and about 256 M on device side (Video RAM). I'll try to reduce memory requirements later. You could also play with some switches to JVM (more precisely to HotSpot) like -Xmx500m or so.

  4. #4
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    2 GB RAM, 1 GB gfx, Windows 7 32-bit

  5. #5
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    It works with -Xmx500m, but I've tried to process 16MB from ENWIK8 and output is different than from bsc or CUDA_STx

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Probably, you have set too small limits to JVM memory usage (or default limits on 32-bit version are small). Anyway, I have reduced memory requirements and uploaded updated version.

    Before Easter I hope to implement:
    - generating (context, pointer) tuples on GPU side,
    - caching compiled kernels, as now they're recompiled on every run,
    - correct algorithms if the output is wrong,

    I haven't written that already: output file consists of 16 MiB of transformed data + 3 bytes of rotation index.

    PS:
    16 MiB is exacly 16777216 bytes.

    If you could provide me a paper describing Schindler Transform in details I'll be able to correct algorithm.

    And unfortuately BSC doesn't have option to disable encoding, ie. doing only tranformation.
    Last edited by Piotr Tarsa; 18th April 2011 at 16:50.

  7. #7
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Attached is an original paper about STx and a program which generates ST5-transformed data using bsc_st5_encode().
    Attached Files Attached Files

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I have updated the program. Now it omits rotation index. I've corrected a bug with sign bits (poor Java doesn't recognize unsigned types) and now output is identical to BSC.

  9. #9
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Currently the output from 16MB of ENWIK8 is only slightly different than from bsc (it can be related to assigning beginning numbers).

    In your logs:
    Total sorting time and transferring: 2403

    But measured with timer.exe:
    Kernel Time = 0.000 = 0%
    User Time = 0.031 = 0%
    Process Time = 0.031 = 0%
    Global Time = 5.158 = 100%

  10. #10
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    About two seconds or so are consumed by kernel compilation.

    Weird that output is different. On my system they are identical, and their md5 sum is: bbc154d58fe4ed8525f9adf19ceeb085

  11. #11
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    On my computer:
    7742e93ac6135c3189b312ec57fcd589 *ENWIK16m.st5
    bbc154d58fe4ed8525f9adf19ceeb085 *ENWIK16m.st5_bsc

  12. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I have updated the program again. I have managed to do jobs from my to do list - now program caches OpenCL program binaries (in %tmpdir%/.StreamPacker/kernelsCache) and generates tuples and retrieves transformed data on GPU side. I hope that now the results will be consistent across systems.

    Unfortunately I haven't managed yet to force compiler to use L1 and L2 caches without breaking the program (ie with some switches it uses those caches but then result is incorrect).

  13. #13
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    The output is different, but still not the same with BSC:
    3463a11bfb8e6df75e488d8f12f7d012 *ENWIK16m.st5

  14. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I have just checked on Windows 7 64-bit and everything is fine. Generally I'm pretty sure that my code is valid - both Java and OpenCL compilers have strict rules to conform so results should be repeatable. I think it's either bug in nVidia drivers or JogAmp's bindings. Alas, I don't have GeForce card so I'm unable to verify that. Anyway, I'll move to implementing BWT on GPU, as STx tranforms perform decently on CUDA cards and it's pretty unlikely that I'll achieve such performance in OpenCL on Radeon cards, ie I'm talking only about Radix Sorting.

Similar Threads

  1. Concurrent kernel execution in OpenCL implementations
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 16th April 2011, 16:04
  2. I'm looking for the best free implementation of deflate
    By caveman in forum Data Compression
    Replies: 2
    Last Post: 22nd November 2010, 09:27
  3. Basic BWT implementation pointer
    By GerryB in forum Data Compression
    Replies: 10
    Last Post: 30th October 2010, 02:08
  4. Move-to-Front Implementation
    By Cyan in forum Data Compression
    Replies: 34
    Last Post: 8th August 2010, 01:11
  5. Statistical implementation of Ziv-Lempel
    By thomas in forum Data Compression
    Replies: 3
    Last Post: 10th February 2009, 20:13

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •