Page 1 of 2 12 LastLast
Results 1 to 30 of 45

Thread: Optimizing BWT decoding

  1. #1
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts

    Optimizing BWT decoding

    Hi guys, so I have a decoding loop for BWT and was wondering if there's a way to make it vectorized among N parallel units. Currently it decodes 8 symbols at a time.
    https://github.com/loxxous/Jampack

    I was wondering if there's a way to break the loop's dependencies such that it can just decode N streams at independently and simultaneously until they've each processed 'step' number of symbols instead of waiting on each iteration of the loop.

    In my experiments adding more streams improves speed to a certain degree, two decoding streams (indexes) usually causes a 2 times speed up compared to using a single BWT index but on 8 parallel streams it appears to be capping out at a 3-4 times speed increase. I'm doubtful any compiler will notice they are 8 non-overlapping streams so I'm pretty sure it's not getting optimized as best as it should be.

    Any advice is appreciated.

  2. The Following 4 Users Say Thank You to Lucas For This Useful Post:

    Bulat Ziganshin (22nd January 2017),Christoph Diegelmann (22nd January 2017),Mike (23rd January 2017),Shelwien (23rd January 2017)

  3. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    my greatest congratulations! not everyday we see 4x speedup in one of the key compression algorithms!!!

    Now why it's "only" 4x. Speed of ordinal unbwt is limited not by cpu cores, but by memory itself. On each step you have to request almost random memory cell, so it can't be cached and each byte decoded means 1 memory access. And you need these data to arrive in order to know address for the next access. So you are limited by the memory latency, which is usually 50-70 ns - hence the 15-20 MB/s decoding speed limit.

    Your genius code traverses 8 indexes simultaneously, this means that in the same 50-70 ns you can decode 8 bytes, and overall speed may be 120-160 MB/s. Well, only in the case when memory has large enough throughput. But it's limited too. My own measures on Sandy Bridge 2600K was that with multiple simultaneous memory access you can read about 50-100 MB/s. I don't remember exact numbers, but probably spedup was in range of 4-6x using only single core, and 6-8x using multiple cores.

    This means that you can get a bit better performance, may be 1.5x extra speedup, by splitting it between multiple threads.

    The limit has nothing common with compiler optimizations. It's easy to see that you still perform only ~1 arithmetic operation per 10 cpu cycles (while up to 4 per cycle may be executed). So, no - SIMD is absolutely useless here, since you utilize even scalar ALUs by less than 10%.

    Finally - modern CPUs executes commands speculatively, limited much more by dependencies rather than original command order. In fact, there is a window of last 100-200 commands not yet executed, and these commands are executed once their data arrive. And this window is large enough to keep several last loops of your cycle.

    The only way i know to speedup unbwt beside this point is to use lzp/dictionary preprocessors. In particular, we can encode words as 16-bit entities and perfrom 16-bit BWT, so each memory access will read 2 useful bytes and give us the entire word number or LZP length.

  4. The Following 2 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Lucas (23rd January 2017),Shelwien (23rd January 2017)

  5. #3
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    First off: Good job !
    I've already done some research in this direction (for my compressor bce but currently I focus on improving libdivsufsort and bce itself) and think I've seen this before somewhere. I've got a folder full of papers on the inverse bwt but sadly I can't find it currently. The papers "Cache Friendly Burrows-Wheeler Inversion" and "Slashing the Time for BWT Inversion" might be a good start. I think they (or another paper) were able to achieve 2 times speed up by decoding 2 chars at a time by calculating MAP[MAP[p[0] - 1] - 1] ahead of time in a more linear way. This might be possible for your version, too.
    Keep us updated !

  6. The Following 2 Users Say Thank You to Christoph Diegelmann For This Useful Post:

    Bulat Ziganshin (22nd January 2017),Lucas (22nd January 2017)

  7. #4
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Thank you Bulat and Christoph! On my desktop (i5-4690K@4.4ghz) it's decoding speed is about 115 MB/s on enwik8 with a 16MB block using all threads so it's not far off from your observation; but that's without the entropy stage since it's not the fastest ans coder in the world. I'll keep working on it.
    Last edited by Lucas; 23rd January 2017 at 06:49. Reason: spelling

  8. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I'm guessing the main speed benefit here is that you have interleaved code such that you don't need the result of p[0] = MAP[p[0] - 1] until 7 other similar instructions have executed, removing waits on memory fetches. Nice work. Do you get a benefit from splitting the next assignment step in 2 too? (Ie read from BWT into a temporary local array and then copy that into T in a second step.) Even prefetches may not help if ultimately it is throughput limited, but you can try things like perf or iaca to see how much waiting is going on; likely little now. The only way I can think of to improve a memory throughput limit is to get more memory accesses hitting cache. With BWT that's a real challenge and I don't see how it works short of some preprocessing step.

  9. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I also wondered if it's possible to avoid so many scatter style operations by buffering up data before distributing it.
    I've had success with this before, but it didn't seem to help here. Eg


    byte X[64];
    int i;
    for (i = 0; i != step & ~7; i+=8) { // decode 8 symbols at once
    for (int j = 0; j < 1; j++) {
    p[0] = MAP[p[0] - 1];
    p[1] = MAP[p[1] - 1];
    p[2] = MAP[p[2] - 1];
    p[3] = MAP[p[3] - 1];
    p[4] = MAP[p[4] - 1];
    p[5] = MAP[p[5] - 1];
    p[6] = MAP[p[6] - 1];
    p[7] = MAP[p[7] - 1];
    X[j + 0] = BWT[p[0] - (p[0] >= idx)];
    X[j + 8] = BWT[p[1] - (p[1] >= idx)];
    X[j + 16] = BWT[p[2] - (p[2] >= idx)];
    X[j + 24] = BWT[p[3] - (p[3] >= idx)];
    X[j + 32] = BWT[p[4] - (p[4] >= idx)];
    X[j + 40] = BWT[p[5] - (p[5] >= idx)];
    X[j + 48] = BWT[p[6] - (p[6] >= idx)];
    X[j + 56] = BWT[p[7] - (p[7] >= idx)];
    }
    for (int j = 0; j < 8; j++) memcpy(&T[i+offset[j]], &X[j*8], 8);
    //*(uint64_t *)&T[i+offset[j]] = *(uint64_t *)&X[j*8];
    }
    for (;i != step; i++) { // decode 8 symbols at once
    p[0] = MAP[p[0] - 1];
    p[1] = MAP[p[1] - 1];
    p[2] = MAP[p[2] - 1];
    p[3] = MAP[p[3] - 1];
    p[4] = MAP[p[4] - 1];
    p[5] = MAP[p[5] - 1];
    p[6] = MAP[p[6] - 1];
    p[7] = MAP[p[7] - 1];
    T[i + offset[0]] = BWT[p[0] - (p[0] >= idx)];
    T[i + offset[1]] = BWT[p[1] - (p[1] >= idx)];
    T[i + offset[2]] = BWT[p[2] - (p[2] >= idx)];
    T[i + offset[3]] = BWT[p[3] - (p[3] >= idx)];
    T[i + offset[4]] = BWT[p[4] - (p[4] >= idx)];
    T[i + offset[5]] = BWT[p[5] - (p[5] >= idx)];
    T[i + offset[6]] = BWT[p[6] - (p[6] >= idx)];
    T[i + offset[7]] = BWT[p[7] - (p[7] >= idx)];
    }

  10. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Meh! How do I post code now? Has this forum changed? I couldn't find any way of doing raw quoting. Odd.

  11. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    It could be related to https move. Anyway, bbcode still works.

    Code:
    test

    [code]test[/code]

  12. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    >Do you get a benefit from splitting the next assignment step in 2 too? (Ie read from BWT into a temporary local array and then copy that into T in a second step.)

    Each of 8 output segments is written sequentially, so it's already fine


    PS: Yeah, i can't answer with quoting and in Advanced answer mode, there are no formatting buttons at all. Also, it's impossile to insert a newline. Hopefully, all smilies are alive, so forum kept its most important part of functionality
    Last edited by Bulat Ziganshin; 23rd January 2017 at 21:06.

  13. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by Shelwien View Post
    It could be related to https move. Anyway, bbcode still works.

    Code:
    test

    [code]test[/code]

    For single lines only. All lines get pasted together, even in a code block, breaking formatting.

    I guess this is an update gone awry and will fix itself with a new update soon. Such is life!

    Edit: no matter! I see it already fixed itself and was purely a display issue. Hurrah to the admins. Thank you.

  14. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    >Do you get a benefit from splitting the next assignment step in 2 too? (Ie read from BWT into a temporary local array and then copy that into T in a second step.) Each of 8 output segments is written sequentially, so it's already fine
    Ah yes I see what you mean. There are 8 disparate writes, but always marching up in a linear fashion so basically it's just updating the same 8 cache lines most of the time so this has little gain. Looks like it is indeed just memory IO bound. I tried to do some hinting with mm_prefetch (NTA etc), but got the usual outcome of not helping. I've occasionally had luck on prefetch, but it seems very rare. It also only reduces latency and can't solve bandwidth issues I think.

  15. #12
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    So since it's mainly a memory throughput limitation couldn't we still be able to achieve even higher speeds by transferring the workload over to an area with even faster memory, as in the GPU? I imagine since GDDR5 bandwidth is anywhere from 4 to 10 times greater than the DDR3 memory in most machines that it's still possible to squeeze even more speed out of this algorithm.

  16. #13
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I think the problem is not necessarily the throughput. To me ~100 MB/s sounds like the CAS latency which is around 6-10 ns (read 100-150 Million ops per second) in most modern DDR3/4. Sequential reading of the memory could get us much faster. But sadly that's the main problem with BWT: it's naturally not sequential which hurts SACAs, context aware compression like bce (antidictionary coding, compression by substring enumeration) and the decoding part). I don't know if GPUs can break this limitation (I'm not a GPU coder).

    The best way is probably to find a way to decode more than 1 Byte per lookup. The paper I had in mind suggested doing something close to the following: Iterate over the BWT and count each char. Now we know the intervals were the sorted suffixes beginning with each char start. So given the current index we can make a binary search to find the first char of the current suffix and using the bwt we can find the char preceeding it. Now if we now have calculated MAP[MAP[p[0] - 1] - 1] ahead of time (making this only 1 lookup => MAP2[p[0]]) we can write 2 chars for each look up. The paper had a clever way to calculate MAP2 resulting in a greatly reduced number of cache misses. Decoding more than 2 byte per lookup requires big tables to get the chars and therefore didn't result in a speedup.

    I hope this helps
    Last edited by Bulat Ziganshin; 4th April 2017 at 22:58.

  17. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    first, a few style guides:
    1. you can replace internal manually unrolled loop with explicit for statement. compilers should be good at unrolling it and anyway speed is limited by memory
    2. i suggest you to give a name to bwt index type such as Idx and replace "int" usages wirth that type name. this will allow to use algorithms you develop with uint/int64 indexes in future. as program size increases, cost of such changes increases exponentially

    On my desktop (i5-4690K@4.4ghz) it's decoding speed is about 115 MB/s on enwik8 with a 16MB block using all threads so it's not far off from your observation; but that's without the entropy stage since it's not the fastest ans coder in the world.
    4.4*4*4~=70, so your cpu can perform about 70 GIOPS, i.e. 500 operations per each byte decoded. and unbwt by itself uses only ~10 opeations. so you left with 490 operations/byte for entropy decoder

    p[0] = MAP[p[0] - 1];
    T[i + offset[0]] = BWT[p[0] - (p[0] >= idx)];
    there is well-known optimization of saving both BWT[] and MAP[] values together since they are indexed by (almost) the same index. So your code will read 4 bytes of MAP[k-1] value and then either one preceding byte of BWT[k-1] or one succeding byte of BWT[k] - in 90% of occasions they will be at the same cache line

  18. The Following 3 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    JamesB (25th January 2017),Lucas (25th January 2017),Shelwien (25th January 2017)

  19. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    afaik, Maxwell GPUs can perform 1 random memory access per 1 Hz of frequency, this probably remains true for Pascal. This means that 1-2 GB/s can be decoded

    ALU intensity of program is very low (10 or even 20 ALU operations per byte * 2 GB/s = 40 GIOPS, while GPUs are capable of several TIOPS), so it will be bound by memory hierarchy speed

    There are 4 main APIs:

    MS C++ AMP: available only in MSVC+windows, compatible with any GPU, has CPU fallback, easiest for programming, poor features
    OpenCL 1.2: available everywhere (win/linux/mac..), compatible with any GPU, hardest for programming, poor features
    OpenCL 2.x: available everywhere, compatible with AMD GPUs/APUs and Broadwell+ IGPs, hard for programming, average features
    CUDA: available everywhere, compatible with NVidia GPUs, average easiness of programming, richest features

    Since the algorithm looks very simple and don't need any advanced features, i suggest to implement it with OpebCL 1.2 for maximum portability. You can use Boost.Compute library to simplify its usage to ~CUDA level



    Afaik, ususal memory access delay on Maxwell is ~600 cycles, but it increased to ~2000 cycles for memory-intensive algorithms. With frequenices about 1 GHz this meant 2000 ns delay of memory access, i.e. 500K accesses/second. So, in order to reach 2 GB/s throughput with such delays, you need to perform 4000 accesses concurrently, i.e. run 4000 GPU threads

    Initial indexes for these 4K threads should be stored among BWTed data. This means that this cycle should be repalced with more efficient solution - either with "SA[i]%step==0" condition (supposing that "%step" calculation will be converted by compiler to multiplication by a constant computed before looping), or with "modf(SA[i]/double(step))<1e-12", or step can be rounded to power of 2

  20. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Shelwien (25th January 2017)

  21. #16
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Thanks for the advice Bulat! I'm more of a hobbyist programmer and appreciate the tips on proper coding paradigms. The code is still in it's early stages and I plan to make an archiver out of it, we'll see how that goes. Right now I'm rewriting the entropy coder to give faster and higher compression so once that's done I'll move onto beefing up the inverse BWT.

  22. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Q. How to improve the speed of an algorithm which is slow because of random memory access?
    A. Make it access memory sequentially more frequently.

    Q. What is sequential access to symbols of BWT transformed file?
    A. Its batch processing of matches in the file.

    Well, its just an idea. What if we'd use hash lookups by match prefixes to locate their most probable suffixes, or something?
    Or batch-decode the match substrings into some LZ token sequence, and then unroll it with LZ decoder (copy the matches)?

  23. #18
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    I've taken your advice, overhauled and refactored the code, now it supports 2048-way BWT decoding, just working on making an interleaved buffer like you suggested and it'll be ready probably within the next week. Also is there actually a way to utilize those unused CPU cycles in the BWT stage?

  24. #19
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Here we go, a concurrent BWT transform. It's just the BWT phase optimized and ripped from the main Jampack project. It supports varying levels of concurrency from 1 to 2048 on multiple threads, uses a fixed block of 16MB but I'll work on making it support variable blocks. I was surprised that the interleaved buffer optimization didn't actually make it faster; it still ran at the same speed as the standard non-cacheable version. In the meanwhile I'll be trying to get it to run on the GPU.

    fyi: a concurrency of 1 is equivalent to a standard BWT implementation.
    Attached Files Attached Files

  25. #20
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Latest version up on github: added multi-threaded decoding with OoOE for a single block, added variable block size support from 1MB to 256MB. I would go higher but I don't have a working 64-bit compiler to use, mingw-64's g++ can't seem to compile the project correctly :/ . I'm including a zip folder with the code and a windows binary to play around with, using a 256MB block and all threads I'm able to invert enwik9 in about 21 seconds on my machine.
    While rewriting the decoder it seems to have taken a 10% speed hit when using all threads on one block compared to the one thread per block scheme. Giving each thread a copy of the pointers into the BWT seems to work better than sharing a list of pointers but there seems to be a dependency somewhere between the threads.
    Attached Files Attached Files

  26. The Following 3 Users Say Thank You to Lucas For This Useful Post:

    Bulat Ziganshin (6th March 2017),Christoph Diegelmann (4th March 2017),willvarfar (3rd March 2017)

  27. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    btw, i found old thread where we researched speed of RAM access, which you may find useful

  28. #22
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    New version on github: added 1GB block support, made the rANS and MTF stage part of the same multi-threaded process, added order-1 information inheritance for the model, now MIT licensed. Finally got to test full block on enwik9, compresses to 200MB and decodes in 29.5 seconds. Considering it's not using any dictionary pre-compression I'm quite happy with the result. Still working on making the model stronger, kinda nice this thread https://encode.ru/threads/2734-Musin...ci-coding-quot showed up to give me some ideas on how to tackle better compressing the BWT.

    Another neat thing I noticed was that the entropy stages barely have any impact on process time, using the same 256MB block as in the previous test it only took 1 second more to actually decompress it from it's packed form and invert it vs just inverting the BWT with no entropy stage. Clearly there's room to use more complex models
    Last edited by Lucas; 22nd March 2017 at 07:37.

  29. The Following 2 Users Say Thank You to Lucas For This Useful Post:

    Bulat Ziganshin (1st April 2017),Christoph Diegelmann (22nd March 2017)

  30. #23
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    122
    Thanks
    36
    Thanked 33 Times in 24 Posts
    FYI, I had to add
    Code:
    typedef uint8_t byte;
    to model.h to compile x64 on Linux (Ubuntu).

  31. #24
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    122
    Thanks
    36
    Thanked 33 Times in 24 Posts
    Corruption when decoding with 8 threads:

    Code:
    time ./Jampack_x64 d enwik8.jpk enwik8.bak -t4
    Read: 23 MB => 95 MB (24.57%) @ 19.72 MB/s
    Completed in 4.82 seconds
    
    
    real    0m1.545s
    user    0m4.760s
    sys     0m0.072s
    
    
    time ./Jampack_x64 d enwik8.jpk enwik8.bak -t8
    
    
     Error: Detected corrupt block!
    
    
    real    0m1.130s
    user    0m5.504s
    sys     0m0.076s

  32. #25
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Quote Originally Posted by hexagone View Post
    Corruption when decoding with 8 threads:

    Code:
    time ./Jampack_x64 d enwik8.jpk enwik8.bak -t4
    Read: 23 MB => 95 MB (24.57%) @ 19.72 MB/s
    Completed in 4.82 seconds
    
    
    real    0m1.545s
    user    0m4.760s
    sys     0m0.072s
    
    
    time ./Jampack_x64 d enwik8.jpk enwik8.bak -t8
    
    
     Error: Detected corrupt block!
    
    
    real    0m1.130s
    user    0m5.504s
    sys     0m0.076s
    That is something I've noticed, I'll try to fix that. Decoding should work on 7 threads though. I think it has something to do with how the workload gets shared.
    Edit: Yep it was, just posted the fix, now it should work for any number of threads without failing. Let me know if you bump into anymore issues.
    Last edited by Lucas; 27th March 2017 at 02:29.

  33. #26
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Quote Originally Posted by hexagone View Post
    FYI, I had to add
    Code:
    typedef uint8_t byte;
    to model.h to compile x64 on Linux (Ubuntu).
    Whoops yeah I tried to remove all occurrences of that type to make it compile with less dependencies but I guess I missed that, thanks for spotting it.

  34. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    unbwt papers i found:
    http://citeseerx.ist.psu.edu/viewdoc...10.1.1.32.5149
    https://www.cs.helsinki.fi/u/tpkarkk...ns/ccp2011.pdf

    Also is there actually a way to utilize those unused CPU cycles in the BWT stage?
    usual compression chain is bwt->mtf->coder. so, at decompression it's decoder->unmtf->unbwt, and we can perform all stages on the single symbol prior to going to the next symbol:

    Code:
    for (i=0..len-1)
    {
      rank = DecodeRank()
      symbol = UnMTF(rank)
      put decoded symbol to the proper place
    }
    the last line in the cycle require ~40 cpu cycles due to RAM slowness, so you can use these 40 cycles to execute code in the first two lines. again, it's speculative execution, so decoder+unmtf will use the same cpu cycles as RAM write in the last line. Moreover, since you can use 4 cores, you have ~160 cpu cycles overal to decode+unmtf a single symbol
    Last edited by Bulat Ziganshin; 4th April 2017 at 23:43.

  35. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Lucas (5th April 2017)

  36. #28
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    I decided to see where most of the speed is going in my tests (again on enwik9) just to see if integrating the rans+mtf stage with the bwt would be worth it, interestingly the I/O write is the second most demanding part of the algorithm. I'll test on an ssd when I can.
    Code:
    rAns+unmtf completed in 3.94 seconds
    Unbwt completed in 16.98 seconds
    UnLZ completed in 0.26 seconds
    Integrity check completed in 0.12 seconds
    Write completed in 7.75 seconds
    Read: 177 MB => 953 MB (18.62%) @ 32.72 MB/s
    Completed in 29.30 seconds
    I'll see if I can figure out a way to integrate the bwt with the entropy stage to save those few seconds.
    Edit: I don't think that would work since the inversion stage needs to know all symbol frequencies within the bwt before it can invert it, so decoding must be done first before it can invert. I could have it write out the raw symbol counts so it can decode on the fly but that would increase the header size by another 1KB of overhead per block, and it's already writing out 840 indexes into the header.
    Last edited by Lucas; 5th April 2017 at 04:22.

  37. #29
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    1. I/O isn't the part of compression algo. SSD write speeds are in 30..3000 MB/s range, and you have 140 MB/s which is very good for HDD. usually i measure speeds by compression or decompression from file cache to nul

    2. bsc uses lzp rather than lz77 for meaning - lz77 output contains bad-compressible match offsets while lzp output has only match lengths. compare their compression efficiency - i'm sure that bsc filter will outperform your one. it's also much faster

    3. do you read my bwt thread? SRC transformation is faster and more efficient than WFC, why you don't use it? and it may be combined with order-1 modeling

    4. i think that a few hundred bytes isn't too much price for 20% speedup, the amount of programming efforts required is more important

    5. your model seems not very efficient, but still unusual. it will be interesing to read its decription, in particular the "inheritance" machinery. can you make an executable with just old mtf + order-0 coder, i interested to know what was the effect of wfc addition?

                for (Index i = 0; i < newLen; ++i)
    MAP[count[BWT[i]]++] = i + (i >= idx);

    can you measure this cycle time separately? i think it can be performed 5x faster by using the radix sort trick, and extra 4x faster by m/t, since this cycle writes into 256 different memory pages simultaneously, hence its speed is limited by L2 TLB throughput. OTOH, i measured on random data, which is very far from BWT output stats

            int len = ((in_p+StackSize) < *Input.size) ? StackSize : (*Input.size - in_p);

    if in_p==(void*)-2 or so, it will fail. always use "bufend-ptr < size" approach
    Last edited by Bulat Ziganshin; 5th April 2017 at 19:05.

  38. #30
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i gathered extra stats for enwik9 encoding:
    time_t start = clock();
    // Gather statistics for the block
    int len = ((in_p+StackSize) < *Input.size) ? StackSize : (*Input.size - in_p);
    rank->encode(&Input.block[in_p], len);
    total[0] += clock()-start; start = clock();
    int rlen = len;
    rle0->encode(&Input.block[in_p], rlebuf, &rlen);
    total[1] += clock()-start; start = clock();
    stats->Build(rlebuf, rlen);
    total[2] += clock()-start; start = clock();
    stats->SetEncodeTable();
    total[3] += clock()-start; start = clock();
    int sptr = 0;
    for(; sptr < rlen; sptr++)
    stack[sptr] = stats->EncGetRange(rlebuf[sptr]);
    total[4] += clock()-start; start = clock();




    Code:
    C:\Jampack-master>timer Jampack_x64.exe c z:\e9 m:\c9 -t1 -b1000
    0: 10.84 seconds
    1: 2.17 seconds
    2: 0.44 seconds
    3: 0.02 seconds
    4: 1.16 seconds
    5: 1.38 seconds
    Read: 953 MB => 177 MB (18.62%) @ 5.83 MB/s
    Completed in 163.56 seconds

    probably wfc occupies lion share of the decoding time too

    btw, lz77 encoding with minmatch=64 even slightly decreases enwik8/9 compression ratio. probably you added it only to improve speed

Page 1 of 2 12 LastLast

Similar Threads

  1. New ASPLOS paper on SIMD FSM's and Huffman decoding
    By Paul W. in forum Data Compression
    Replies: 0
    Last Post: 22nd April 2014, 04:26
  2. PDF optimizing
    By SvenBent in forum Data Compression
    Replies: 8
    Last Post: 16th January 2014, 13:37
  3. BWT + LZMA
    By Zelex in forum Data Compression
    Replies: 10
    Last Post: 21st July 2013, 11:44
  4. The BWT is redundant...
    By nburns in forum Data Compression
    Replies: 29
    Last Post: 10th April 2013, 03:59
  5. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 03:01

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •