Results 1 to 7 of 7

Thread: Rans tricks

  1. #1
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts

    Rans tricks

    I managed to optimize rans close to Finite State Entropy's speeds without using platform specific intrinsics or assembly: https://github.com/loxxous/rans-tricks

    I doubt this is worth a forum post but hey there's no point keeping it to myself.

    I've included 3 exe's: rans_std uses Fabians original rans_byte header, rans_trick_8 uses 8 bit branchless renormalization (identical compression to rans_std, just faster), rans_trick_16 is 16 bit branchless renormalization which is a little bit faster since it uses a non-iterative approach.

    Usage example: "rans_std.exe D:\downloads\file.dat", it only benchmarks files up to 100MB large because I'm lazy, it's mainly just a proof of concept.

    In my tests adding more than 4 states does not improve performance since now it's limited by operation dependencies, the next step would be intrinsics but that's rans_static's territory

    Let me know what you think.
    Attached Files Attached Files
    Last edited by Lucas; 27th April 2018 at 07:25. Reason: Updated trick_16 with branchless encoding

  2. The Following 6 Users Say Thank You to Lucas For This Useful Post:

    Bulat Ziganshin (15th May 2018),Cyan (27th April 2018),JamesB (27th April 2018),Jarek (30th April 2018),jibz (27th April 2018),Stephan Busch (27th April 2018)

  3. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    The code in RansEncPutSymbol is definitely a nice improvement, although a bit compiler specific in turns of gains. I experimented with trying to get cmov instructions used to avoid the branching, but it's very hard to get compilers using them!

    With gcc 7.1 I found it improved encoding speed for poorer compressed data and harmed speed on highly compressible data, presumably due to how well the branch predictor is working. With clang 5.0 it was faster at both (although again much better on harder to compress data).

    The RansEncRenorm looks to be doing 16-bit normalisation instead of 8-bit, which differs from Ryg's original rans_byte. I guess this is where much the extra (initial) speed is coming from. Good gains. I ought to try this myself as the encode speeds have always been my weakness. Food for thought!

    Edit: that system was pretty old. On a newer machine the speed even on low-entropy data it's not losing time, and gains elsewhere. Nice trickery.
    Last edited by JamesB; 27th April 2018 at 16:49.

  4. #3
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    The decoder re-normalization I did for trick_8 was this:



    uint8_t *ptr = *pptr;
    int32_t s = x < RANS_BYTE_L;
    uint32_t tx = x;
    do
    {
    tx = (tx << 8) | *ptr;
    ptr += s;
    }
    while(tx < RANS_BYTE_L);
    *pptr = ptr;
    x = (-s & (tx - x)) + x;



    It avoided a branch but still had to iterate, I only posted the source for the 16-bit version since it's the fastest.

  5. #4
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Lucas View Post
    x = (-s & (tx - x)) + x;
    I use the same trick in my M99 coder. I also found it faster than cmov.

  6. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Cmov can be faster, but it's hard to get it used by the compiler.

    Benchmarking the above code vs some assembly using cmov on two files that encode to 1.84 bits per byte and 4.54 bits per byte gives me these decode speeds (best of 10):

    asm cmov: 507.7 MB/s, 495.3 MB/s
    no-branch: 414.8 MB/s, 411.8 MB/s
    original ryg: 358.1 MB/s, 365.8 MB/s

    The assembly version is:

    Code:
    static inline void RansDecRenorm(RansState* r, uint8_t** pptr) {
        uint32_t x = *r;
        uint8_t  *ptr = *pptr;
        __asm__ ("movzbl (%0), %%eax\n\t"
                 "mov    %1, %%edx\n\t"
                 "shl    $0x8,%%edx\n\t"
                 "or     %%eax,%%edx\n\t"
                 "cmp    $0x800000,%1\n\t"
                 "cmovb  %%edx,%1\n\t"
                 "adc    $0x0,%0\n\t"
                 : "=r" (ptr), "=r" (x)
                 : "0" (ptr), "1" (x)
                 : "eax", "edx"
                 );
        if (x < 0x800000) x = (x << 8) | *ptr++;
        *pptr = ptr;
        *r = x;
    }
    It's only half branchless cmov because the second renorm half is rarely triggered and very predictable by the CPU. The 16-bit renorm version is around 660MB/s instead. Insane 32-way unrolling with AVX2 is 1.4GB/s, but that's a bit extreme and has other issues with small data sets.

    Clang prefers the assembly too, while icc runs tragically slow on it (about half speed) but does a much better job on Lucas' branchless variant. The branchless code certainly seems to be the optimal case for pure C, but a bigger question is whether we can get the compilers to generate code like the assembly above, maybe with hints about predicatability of branching?

  7. The Following User Says Thank You to JamesB For This Useful Post:

    RamiroCruzo (30th April 2018)

  8. #6
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    For what it's worth, the ryg_rans github repo contains a word-based renorm coder in rans_word_sse4. You can use the word-based renorm part without the SSE4.1 4-way interleaved decoder part.

    Also, I wish to emphasize that ryg_rans was written for clarity not speed!

    I just put up what I consider a "production quality" 2x interleaved rANS implementation (straight from BitKnit) here: https://gist.github.com/rygorous/b43...796b5f96671cbe (I just copy&pasted the code from BitKnit, apologies if I missed some dependency.)

    BitKnit uses 15-bit probabilities, 16-bit (word-granularity) renormalization, and a 2x alternating-stream-interleaved setup (as described way back in https://fgiesen.wordpress.com/2015/1...s-in-practice/). I've found this to be a fairly sweet spot for rANS coding with adaptive probabilities. [If you're purely static, you should be using tANS/FSE instead.] If you have any adaptation in the loop, it's hard to get a win from more than 2 streams, in my tests anyway, and the alternating-stream trick is really nice for 2 streams in particular since it basically turns into nothing.

    It also has a few other features like raw "bypass" coding (sending a few bits pass-through) and implicitly juggles 2 alternate streams without you having to do any manual interleaving - much more convenient to use when you have multiple paths through an encoder/decoder. The idea was to make an encoder that's basically a plug-in replacement for a regular arithmetic encoder, while still getting some of the nice rANS benefits. It assumes you're flushing the rANS encoder every 64k of input data so the encoder output buffer space is nice and bounded. This one's written for adaptive probabilities (because that's what BitKnit uses) so there's no attempt to speed up the divisions or anything like that.

  9. The Following User Says Thank You to fgiesen For This Useful Post:

    JamesB (1st May 2018)

  10. #7
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts
    I can confirm fabians results.

    During rz's design I had to decide between adaptive or semi-static modeling. I went fully adaptive using rANS. I did lots of experiments but ended up using two alternating states and 16-bit renormalization, too.

    Alternating states are very convenient and provide good performance. I will probably change back to explicit interleaving, once the syntax and everything is frozen. Explicit interleaving can be faster, but is quite an obstacle, if you're experimenting with your codec's syntax.

    With 16-bit renormalization you don't need a loop or anything, if you keep your probability-precision narrow enough. If you're doing pointer arithmetics instead of branches, it doesn't make a difference.

  11. The Following 2 Users Say Thank You to Christian For This Useful Post:

    Bulat Ziganshin (15th May 2018),Jarek (7th May 2018)

Similar Threads

  1. Published rANS patent by Storeleap
    By Jarek in forum Data Compression
    Replies: 119
    Last Post: 19th February 2019, 12:21
  2. Replies: 33
    Last Post: 2nd July 2018, 20:18
  3. Replies: 45
    Last Post: 25th November 2016, 04:30
  4. AVX-512 and interleaved rANS
    By JamesB in forum Data Compression
    Replies: 41
    Last Post: 6th November 2016, 15:26
  5. Replies: 4
    Last Post: 3rd February 2012, 04:35

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •