Page 1 of 2 12 LastLast
Results 1 to 30 of 39

Thread: [asm] Jumpless bitcodec

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    [asm] Jumpless bitcodec and LZ matchfinder

    Huffman and other bit-granular coders based on the two operations: putbits(x,n) and getbits(n) what encodes/decodes n bits. and these operations usually involves jump with rather high probability, around 3-30%. So, "jumping coder" (v0) looks like:

    Code:
    uint64 bitbuf;  // Bit buffer - written to outstream when filled
    int bitcount;  // Count of lower bits already filled in current bitbuf
    uint64 *ptr;
    
    putbits (x,n):
      bitbuf |= x << bitcount;
      bitcount += n;
      if (bitcount >= 64) {
        *ptr++ = bitbuf;
        bitcount -= 64;
        bitbuf = x >> (n-bitcount);
        if (ptr >= bufend)
          perform I/O...
      }
    in the modern intel x86/x64 cpus, jumps are pretty expensive, while arithmetic is cheap and loads/stores to L1 cache usually are free. so we can try to replace expensive jumps with free stores and cheap arithmetics (jumpless coder, v1):

    Code:
    putbits (x,n):
      bitbuf |= x << bitcount;
      bitcount += n;
      *ptr = bitbuf;
    
      t = bitcount/32;
      (char*)ptr += t*4;
      bitbuf >>= bitcount&31;
      bitcount %= 32;
    
      if (ptr >= bufend)
        perform I/O...
    we've replaced one compare&jump with high probability with 6 arithmetic operations plus another compare&jump but with the low probability. unexpected jump takes 15-20 cycles, while 6 ops plus expected jump are about 2 cycles. so if (bitcount >= 64) jump probability is less than 10-15% then new code should perform better

    on the x86 platform, we should replace 64-bit buffer with 32-bit one that immediately doubles (bitcount >= sizeof(bitbuf)) jump probailty, making advantage of this optimization even more probable


    ADD: v1 coder suffers from additional arithmetic operations, especially shift.CL that is rather slow on modern intel cpus. we can try to reduce their amount by writing data directly to memory (v2/v3):

    Code:
    static unsigned bitcount;  // Number of bits already filled in the *ptr byte
    static char *ptr, buf[20*1000], *bufend;
    
    void jumpless_putbits_v3 (BITBUF x, unsigned n)
    {
      *(BITBUF*)ptr |= x << bitcount;
      bitcount += n;
    
      ptr += bitcount/8;   // 2 extra arithmetic operations compared to v0
      bitcount &= 7;       // 1 more extra arithmetic operation
    
      if (ptr >= bufend)
        perform I/O
        zero-fill buf
    }
    here, bitcount holds 0..7 - number of bits already filled in the *ptr byte. once byte is filled, ptr increases to point to the current partially-filled byte and bitcount masked back to 0..7 range. it needs 3 extra arithmetic operations compared to v0, i.e. about 1 cpu cycle
    Last edited by Bulat Ziganshin; 19th November 2013 at 13:36.

  2. The Following 5 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (13th November 2013),encode (14th November 2013),Matt Mahoney (13th November 2013),Paul W. (14th November 2013),sh0dan (13th December 2013)

  3. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i've tried my idea and a few variations, but it doesn't work. v1 (original jumpless) codec is about 1.5-1.7x slower than v0 (jumping) and other versions (v2..4) are not better too. it seems that problem is what we need to perform one more expensive operation - either shift.CL or unaligned store operation. v4 should use CMOV but i don't managed how to force gcc4 to use CMOV instead of conditional jump

    nevertheless, it may be good for other bit distributions and when bitcoder is joined with match finder engine. last note: jumping bitcodec is much more friendlier to HyperThreading - while one thread waits for jump, another thread can execute its code at full core speed. so if you will use jumpless bitcodec, check the full program with HT too

    for reference, i've compiled with gcc 4.8.1 and "g++ -O3 -mtune=generic -funroll-all-loops -Wa,-adhlns=jumpless.lst jumpless.cpp" and tested on Haswell
    Attached Files Attached Files

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i've found source of problems in v2/v3 codec. they suffers from read-after-write with adjancent addresses. if their first lines are changed to "*(BITBUF*)ptr = x << bitcount;" - i.e. with "=" instead of "|=", they have about the same speed as jumping coder. a 5% percents more or less. for a codec joined with matchfinder it should be not a problem. so we have found codec, that has the same speed, but still less HT-friendly

    also, it spends around 4 cpu cycles on Haswell for one putbits() operation while performing only 7 simple arith operations. it definitely suffers from operand dependencies and have more chances to improve its speed when combined with matcfinder code. OTOH, jumping coder will just wait on its conditional jump

    so, i expect that v3 codec will work faster than v0 in single-threaded LZH compression on Haswell
    Last edited by Bulat Ziganshin; 14th November 2013 at 11:26.

  5. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I'm not following along too well, because this is not something I know much about. But ?: is supposed to generate conditional moves, if that helps.

  6. The Following User Says Thank You to nburns For This Useful Post:

    Bulat Ziganshin (14th November 2013)

  7. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Changed REPETITIONS to 1000*1000*100 and compiled with g++ 4.8.0 -O3 on a 2.0 GHz T3200, 3 GB, 32 bit Vista.

    jumpless 0 = 9.9 sec
    jumpless 1 = 14.4 sec
    jumpless 2 = 12.6 sec
    jumpless 3 = 15.3 sec
    jumpless 4 = 13.9 sec

    Edit: But I thought that the jumps might be predictable because they follow a repeating sequence of lengths from 2 to 16. So I replaced each putbits(12345,j); with putbits(12345,hash(i,j)); where hash(i,j) returns a pseudo-random number from 2 to 16:

    Code:
    unsigned hash(unsigned i, unsigned j) {
      unsigned t=i*314159265^j*271828182;
      t^=t>>16;
      return t%15+2;
    }
    and now the jumpless coders 2 and 3 are faster than 0.

    jumpless 0 = 17.7 sec
    jumpless 1 = 23.4 sec
    jumpless 2 = 13.6 sec
    jumpless 3 = 15.9 sec
    jumpless 4 = 23.1 sec
    Last edited by Matt Mahoney; 14th November 2013 at 19:25.

  8. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by nburns View Post
    I'm not following along too well, because this is not something I know much about. But ?: is supposed to generate conditional moves, if that helps.
    Yes, but so does an equivalent "if" statement. The two implementations of max() below generate identical code.

    Code:
    // a.c
    unsigned max1(unsigned a, unsigned b) {
      if (a>b)
        return a;
      else
        return b;
    }
    
    unsigned max2(unsigned a, unsigned b) {
      return a>b ? a : b;
    }
    
    C:\tmp>gcc -O3 -S a.c
    
    C:\tmp>type a.s
            .file   "a.c"
            .text
            .p2align 4,,15
            .globl  _max1
            .def    _max1;  .scl    2;      .type   32;     .endef
    _max1:
            movl    4(%esp), %edx
            movl    8(%esp), %eax
            cmpl    %eax, %edx
            cmovae  %edx, %eax
            ret
            .p2align 4,,15
            .globl  _max2
            .def    _max2;  .scl    2;      .type   32;     .endef
    _max2:
            movl    4(%esp), %eax
            movl    8(%esp), %edx
            cmpl    %eax, %edx
            cmovae  %edx, %eax
            ret
            .ident  "GCC: (rev1, Built by MinGW-builds project) 4.8.0"

  9. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think the difference between ?: and if() is that ?: is supposed to become a cmov, but changing an if() to a cmov is an optimization, which is harder to control.

  10. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    There's nothing like that in the language definition. Can you show me where they produce different code?

  11. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    There's nothing like that in the language definition. Can you show me where they produce different code?
    I think I saw this on something like stackexchange. The language definition wouldn't be able to weigh in on this, I don't think, because it obviously depends on the existence of the instruction. You'd probably want to consult specific compiler docs to see how they handle this.

    Edit: Actually, it's even easier than reading the docs. Just compile with optimizations off. That should tell you what code naturally "means," before the optimizer stirs it up.
    Last edited by nburns; 14th November 2013 at 21:08.

  12. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    compiling without optimization is just easiest way to produce code. there is no any official definition of asm code produced from C++ one, nor standard, nor optimized. moreover, it's simpler to optimize "if" without else to cmov rather than if-else or equivalent ?: statement

    I thought that the jumps might be predictable
    i thought about it too but don't realized that i can add any hashing function since it anyway needs the same time in any codec. so it seems that jumpless codec may be faster and it better should be tested in real apps

    for LZ4, bitcodec may improve compression by 10-20% while decrease speed for the same 10-20%. 32-bit v3 codec may encode up to 25 bits in the single putbits() call and one LZ4 match may be encoded with 9-25 bits in the following way:

    1 bit (literal "1") + 4 bits (match length - 4) + 4 bits (match offset length) + 0..16 bits (match offset remaining bits)

    so, having match length and match offset, we first produce 25-bit number containing all the data and then make just one putbits() call

    literals may be encoded as: 1 bit (literal "0") + 4 bits (literal run length) + 1..16 bytes (literals)

    retrieving next match data may be overlapped with match/literals encoding in order to reduce cache delays. well, it should help in plain LZ4 too
    Last edited by Bulat Ziganshin; 14th November 2013 at 22:21.

  13. #11
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    compiling without optimization is just easiest way to produce code. there is no any official definition of asm code produced from C++ one, nor standard, nor optimized.
    There can't be one. There is no standard assembly language to define it for. C++ is architecture-agnostic.

    However, I think the code that results from compiling with optimizations off tells you a lot, and I believe I tried this and verified that ?: generates cmov (not convenient to try now), at least in gcc. What this tells you is that the compiler tracks the fact that it was a ?: and not an if in its internal representation. If ?: was exactly equivalent to if, you would never see differences in the code they generate.

    And it simply isn't true in general that optimizations and no optimizations are just the same thing with different levels of complexity. Optimization is a completely separate process in compilers and compiler theory. It's not exact and can always screw things up. That's why you can turn it off.

    moreover, it's simpler to optimize "if" without else to cmov rather than if-else or equivalent ?: statement
    The ?: doesn't have to be optimized to cmov, because it already starts out as cmov. Whether you could force ?: to become a jump is a more interesting question...
    Last edited by nburns; 14th November 2013 at 21:51.

  14. #12
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    You said that loads and stores to L3 cache are usually free... did you mean L1? Or are you assuming the block is usually cached in L1 as well, and saying it works out fine that way too?

  15. The Following User Says Thank You to Paul W. For This Useful Post:

    Bulat Ziganshin (14th November 2013)

  16. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    thank you, of course i meant L1

  17. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    gcc without optimization produces different code for max1() and max2(). It uses CMOV for max2() only. With -O1, -O2, or -O3 it uses CMOV for both functions.

    VC++ (v16) with /arch:SSE2 /O1 uses CMOV for both functions. For /O2 or no optimization, or without /arch:SSE2, it uses jumps for both functions.

  18. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    nburns (15th November 2013)

  19. #15
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Hello everyone,

    I'd like to mention that instead of getting rid of jumps, if you know quite precisely the probability of those jumps - which is the case here - and are using GCC (no ASM), you might consider the following two functions, as per the Linux kernel :

    #define likely(x) __builtin_expect((x),1)
    #define unlikely(x) __builtin_expect((x),0)
    If you know your jump has a low probability (10% ?) you could use if(unlikely(your test)) then blabla ...
    By doing this you inform your processor that "your test" is unlikely, and the amount of spent clock cycles is lower if you're right. On the other hand if you're wrong it is higher, so you'd better be sure

  20. #16
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    VC++ (v16) with /arch:SSE2 /O1 uses CMOV for both functions. For /O2 or no optimization, or without /arch:SSE2, it uses jumps for both functions.
    Maybe VC++ uses a jump for ?: by default because some expressions would need to be jumps to avoid double side effects from computing both branches, so it's considered safer to do them all the same way. It could also signify that VC++ has no separate code for ?: in their intermediate format, unlike gcc. You could probably make arguments for both ways of handling it. Apparently, either a jump or a cmov can be faster, depending on how it all breaks down.

  21. #17
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    gpnuma, i know and use __builtin_expect, but it's another optimization. __builtin_expect don't tell cpu anything, in the best case we can expect that compiler will reorganize jump so it's not performed in expected case - this can improve instruction loading but probably not much. when cpu performs code, it knows how the real jump occurs and for jump that occurs with 10-20% probability, i expect that it will speculatively execute the most probable branch so it will not have any delays. while less probable branch will be executed with 10-20 cpu cycles delay required to load its instructions. it's all described in agner books we've mentioned on the forum

  22. #18
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    gpnuma, i know and use __builtin_expect, but it's another optimization. __builtin_expect don't tell cpu anything, in the best case we can expect that compiler will reorganize jump so it's not performed in expected case - this can improve instruction loading but probably not much. when cpu performs code, it knows how the real jump occurs and for jump that occurs with 10-20% probability, i expect that it will speculatively execute the most probable branch so it will not have any delays. while less probable branch will be executed with 10-20 cpu cycles delay required to load its instructions. it's all described in agner books we've mentioned on the forum
    I was saying that for the sake of simplicity ... What it actually does is it tells the compiler to generate instructions that favor the likelyness of your jump instructions (informs your processor by means of the generated instructions ... wording wasn't adequate ...?), as you say.

    If the jump goes on par with the likelyness, it costs less clock cycle and it can sometimes even be free. Otherwise the processor pipeline will need flushing and that will cost more clock cycles.

    But again I agree, modern processors are extremely good at branch predictions so it might not have much of an added value, but it's still good to have it in the corner of your mind !

  23. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    we can go even further, extending v3 codec with the following simple function:
    Code:
    // 3 arithmetic operations
    void simple_putbits_v3 (BITBUF x, unsigned n)
    {
      *(BITBUF*)ptr |= x << bitcount;
      bitcount += n;
    }
    it skips overflow checking and may be used when we are sure that overflow can't occur. v3 codec keeps bitcount<=7, so with 32-bit BITBUF there are at least 25 bits that be written without overflow, and with 64-bit BITBUF we have at least 57 available bits. for example, abovementioned writing of 25 bits encoding LZ4 match may be implemented as following:
    Code:
    simple_putbits_v3 (0, 1);  // == bitcount++ 
    simple_putbits_v3 (match_len, 4);
    simple_putbits_v3 (match_offset_bits, 4);
    putbits_v3 (match_offset, match_offset_bits);
    here, only the last call performs the bitcount checking. but going further, we can combine all the fields in the register and then write it to memory in the single operation:
    Code:
    bitcount++;  // simple_putbits_v3 (0, 1)
    BITBUF saved_data = match_len | (match_offset_bits<<4) | (match_offset<<8);
    putbits_v3 (saved_data, 4+4+match_offset_bits);
    the whole fragment needs 11 simple ops plus one shift.CL and may be easily combined with preparation of next match search (LOAD+MUL+SHIFT.const+LOAD+LOAD), making entire match encoding operation free
    Last edited by Bulat Ziganshin; 15th November 2013 at 20:05.

  24. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    a bit more on LZ4+bitcodec: once match was found, we can start handling the next match:
    Code:
    uint32 data = *(uint32*)(curptr+curmatch_len); // Load from L1 cache, 4 CPU cycles
    uint32 hash = (data*42)>>20; // mul+shift.const, 3+1 CPU cycles
    uint32 *match_ptr = hashtable[hash]; // Load from L1 cache, 4 CPU cycles
    uint32 match_data = *match_ptr; // Load from L1/L2 cache, 4-11 CPU cycles
    // 16-23 CPU cycles, 3 load operations total
    It needs only 5 operations, including 2 arithmetic ones, but since each operation depends on the previous one, their delays are summed up totalling in 16-23 cycles. Core2..Haswell cpus can execute ~50-90 aritmetic operations in this time, while we are consuming only 2. This makes it perfect for HyperThreading or overlapping this code with bitcoder and even hufcoder: as we have calculated in the previous post, the entire bitcoder implementation requires about 20 ariops

    OTOH, we can exploit these cycles by checking more than one possible match or implementing lazy matchfinder. it seems that they should be almost free since it needs only a few more ariops and delays are overlapped:
    Code:
    uint32 data = *(uint32*)(curptr+curmatch_len);
    uint32 hash = (data*42)>>20;
    uint32 *match_ptr0 = hashtable[hash*2];
    uint32 *match_ptr1 = hashtable[hash*2+1];
    uint32 match_data0 = *match_ptr0;
    uint32 match_data1 = *match_ptr1;
    if (data==match_data0) ...
    else if (data==match_data1) ...   // 1 extra ariop if previous comparison failed
    // 5 load operations total
    this code checks 2 possible matches and spends only 1 extra ariop at most

    lazy matching is a bit more expensive:
    Code:
    uint32 data0 = *(uint32*)(curptr+curmatch_len);
    uint32 data1 = *(uint32*)(curptr+curmatch_len);
    uint32 hash0 = (data0*42)>>20;
    uint32 hash1 = (data1*42)>>20;   // 2 extra ariops
    uint32 *match_ptr0 = hashtable[hash0];
    uint32 *match_ptr1 = hashtable[hash1];
    uint32 match_data0 = *match_ptr0;
    uint32 match_data1 = *match_ptr1;
    if (data0==match_data0) ...
    else if (data1==match_data1) ...   // 1 extra ariop if previous comparison failed
    // 2*3 load operations total
    So it needs 2-3 extra ariops


    We can cosider even 4-entry hastables (7 load operations), or lazy matching with 2/4 entry hastables (2*5 or 2*7 loads). Taking into account that Core2 can perform 1 load per cycle, Sandy Bridge+ - 2 loads per cycle and the bitcoder doesn't use too much memory operations, even lazy matching with 4-entry tables may become almost free, i.e became a part of this 16-23 cycle delay that we cannot avoid
    Last edited by Bulat Ziganshin; 15th November 2013 at 22:12.

  25. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (15th November 2013)

  26. #21
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I'm not sure I understand the context, or whether you're interested in relying
    on vector instructions, but it seems to me that you should be able
    to compute all the the cumulative lengths in log time, as a parallel
    prefix sum, so summing 8 one-byte lengths in a 128-bit SSE register
    would take a handful of instructions going up and a another handful
    going down.

    That would give you shifts you could use to shift the actual codes into place,
    and I'd think you could get some parallelism there, too. Even if not, the
    whole thing would be branchless.

  27. #22
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Hmmm... I may not be thinking clearly, but it seems like you could do the shifting and OR'ing of
    in parallel, too, in another pass up the parallel prefix tree. (Not up and down.) Just shift each
    thing by the size of its right neighbor at each level, and XOR them. You'll shift a lot of things
    multiple times as you go up the tree instead of all at once, but you won't care because you've
    got to do an instruction or two at each level anyhow. Since you have half as many things
    to deal with at each step up, you can use half as many subregisters twice as long at each
    level, and there's no chance of overflow.

  28. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Paul, moving between general-purpose and SSE registers isn't free, so SSE is useless here (if you mean calculation of saved_data variable)

  29. #24
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I was more thinking about situations where you can't fold together what are really multiple codes, and combine the lengths at compile time, and do just one putbits(), as in your example, and you don't have a long-latency instruction to hide the putbits() behind.

    For example, suppose I'm Huffman coding an array of one-byte literals. Each one of those will require a putbits(), but I can Huffman code eight literals at a time using parallel prefix on SSE2, or 16 with AVX. Outputting a single symbol (or 8 or 16) takes longer my way, but I can do 8 or 16 at a time, except for reading the initial 8 or 16 bytes and writing the resulting shifted bits.

    In this scheme, I don't get any fancy superscalarness with instruction scheduling---I just get the SIMD parallelism I write, and it all happens in the vector unit, not using the general-purpose ALUs at all.

    That seems like it could potentially let me use the general-purpose ALUs for something else at the same time, such as mapping the next 16 byte values to their Huffman coding info. (Using successive byte values as indexes into an array, and fetching the corresponding codes & lengths.)

    On the one hand, it seems like you can usually hide a few instructions per symbol in something else you're doing, but on the other it also seems like I should be able to use the vector unit for bit coding while I'm doing something else, too.

    (But then, I have no experience at this sort parallelization, and may be missing something, or several things.)

  30. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    v2/v3 coders suffer from "Store forwarding stalls", that's 10-12 cycles on Core2..Haswell cpus. This means that it's preferable to call full putbits() only once per 10-12 cpu cycles, using simple_putbits() inbetween. or we can make several successive putbits() calls if their overall stall time will be amortized by firther (match finding) code. in my tests, putbits_v3() coder spent ~13 cycles on average per operation (with j=2..16), but only ~4 cycles when i suppressed the stall (by replacing "|=" with "="). v2 may produce less stalls but has its own small disadvantages
    Last edited by Bulat Ziganshin; 15th November 2013 at 22:06.

  31. #26
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Paul, may be you can write some (pseudo)code? moving from GPR to XMM is 2 mops and moving data back is 2 more mops. so if you perform just one arithmetic operation in XMM register, it doesn't pay itself. and if you know SSE/AVX, it's hard to do anything useful in these registers

    and you definitely should learn http://www.agner.org/optimize/microarchitecture.pdf and http://www.agner.org/optimize/instruction_tables.pdf In particular, both scalar and vector instructions use the same ALUs
    Last edited by Bulat Ziganshin; 16th November 2013 at 12:45.

  32. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Paul W. (16th November 2013)

  33. #27
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I googled up some other folks doing prefix sums with SSE and/or AVX, and found these over at Stack Overflow:

    http://stackoverflow.com/questions/1...e-sum-with-sse
    http://stackoverflow.com/questions/1...m-on-intel-cpu

    In both cases the good stuff is toward the end.

    In particular, have a look the "tested and working code" toward the end of the first one for floats with SSE, which is supposed to give 2x speedup over the non-SIMD version. I'd think you'd do better for 8 16-bit ints, and maybe still better for 16 8-bit bytes if you special case the first layer. (The first layer would be slower, but it would be worth it.)

    I'm not at all sure though, partly because I don't yet understand the latency issues for the vector code, and whether you could interleave other activities with it usefully.

    Either way, thanks very much for your comments. I have started reading Agner Fog and it will be very useful.
    Last edited by Paul W.; 16th November 2013 at 11:44.

  34. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Paul, they are reading data from memory in 16/32-byte words. with moving every separate part to and from GPR, you will spend 4 mops for PINSR+PEXTR for an economy of one GPR ADD

    overall, C++ view, asm view and microarchitecture view are all very different beasts so things looking reasonable on higher levels in many cases doesn't work on real hardware
    Last edited by Bulat Ziganshin; 16th November 2013 at 12:45.

  35. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Paul W. (16th November 2013)

  36. #29
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Compression of 1 milliard bytes executable (exe9), all numbers are in thousands:
    -1: 590mb. Literals 325966, matches 72453 = 54602 + 17258 + 593 (with distances up to 4K/256K/other)
    -3: 413mb. Literals 222374, matches 80076, rep0s 22869
    -5: 330mb. Literals 143523, matches 80508, rep0s 9264

    Compression of enwik9:
    -1: 531mb. Literals 198148, matches 106546 = 63145 + 42943 + 457
    -3: 350mb. Literals 65069, matches 121569, rep0s 2694
    -5: 300mb. Literals 41727, matches 99981, rep0s 1387


    These are stats for tornado, whose -1 mode is very similar to LZ4, only with unlimited distances (that doesn't buy us too much). The most interesting part is that we have only 7-10 matches per 100 input bytes, and as i wrote earlier, code after successful match is the only real problem, everything else can be pipelined

    With about 20 cpu cycles per successful match, LZ4 speed on my 4770 cpu should be around 2 GB/s while now it's only 500 MB/s. Where is the catch? I think, it's two problems: register shortage especially in 32-bit x86, and unpredicted matchfinder jumps. Let's talk about those jumps

    Once we have loaded match_data into register, we can start matching:
    Code:
    if (match_data!=data) goto try_next_pos;  // on exe9, it jumps in 325/(72+325)=80% cases
    if (*(uint32*)(match_ptr+4) != *(uint32*)(ptr+4)) goto 4_7_bytes_matched;
    if (*(uint32*)(match_ptr+8) != *(uint32*)(ptr+8)) goto 8_11_bytes_matched;
    ...
    First unpredicted jump for successful match is skip of "goto try_next_pos", that occurs only in 20% of cases. The second unpredicted jump is jump to some x_y_bytes_matched. We can avoid further unpredicted jumps by using BSF(dataN^match_dataN)/8 to compute number of matched bytes. Second jump can also be avoided if we have some wide 32+ byte regusters with BSF operation. May be there is a way to do it using SSE/AVX or combine BSF results from several GPR registers, i don't have idea now. we may also recall SSE 4.2 string operations and their wider AVX2 equivalents

    Due to register shortage, it's better to store in registers only xored numbers:
    Code:
    // First, preload data from L1 cache
    xored1 = *(uint32*)(match_ptr+0) ^ *(uint32*)(ptr+0)
    xored2 = *(uint32*)(match_ptr+4) ^ *(uint32*)(ptr+4)
    xored3 = *(uint32*)(match_ptr+8) ^ *(uint32*)(ptr+8)
    
    // Now start matching
    if (xored1) goto try_next_pos;
    xored1 = *(uint32*)(match_ptr+12) ^ *(uint32*)(ptr+12)  // preload next wordpair
    if (xored2) goto 4_7_bytes_matched;
    xored2 = *(uint32*)(match_ptr+16) ^ *(uint32*)(ptr+16)
    if (xored3) goto 8_11_bytes_matched;
    ..
    
    4_7_bytes_matched:
    match_len = 4+BSF(xored2)/8 
    goto match_found;
    
    8_11_bytes_matched:
    match_len = 8+BSF(xored3)/8 
    goto match_found;
    This code should spend 16-23 cycles in match preparation code, 2*(15..20) cycles in two unpredicted jumps plus a few more cycles searching for the first non-zero xoredN (since average match length is 8-10). The first jump seems unavoidable since we should have two different code branches for matches and non-matches, while second one probably may be avoided for matches of reasonable length (up to 16..32) either by using SSE 4.2 or by combining BSF, CMOV and some arithmetic tricks

  37. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (16th November 2013)

  38. #30
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Now, jumpless match length computation:
    Code:
    MOV r0, [data_ptr+4]
    XOR r0, [match_ptr+4]
    BSF r0, r1
    SHR r1,3
    
    MOV r0, [data_ptr+8]
    XOR r0, [match_ptr+8]
    BSF r0, r2
    SHR r2,3
    ... now r1, r2... stores number of matched bytes at match_ptr+4, match_ptr+8... We can even use SSE/AVX to make XOR and then load individual words from SSE register:
    Code:
    MOVDQU xmm0, [data_ptr]
    PXOR xmm0, [match_ptr]
    MOVD r0, xmm0
    TEST r0,r0
    JNZ match_not_found
    
    PEXTR r0, xmm0, 1
    BSF r0,r1
    SHR r1,3
    ...
    SSE variant needs more arithmetic mops, but less registers/L1 cache reads. Using CMOV, we can combine individual BSF results:
    Code:
    ; The code is still incorrect, i will fix it later
    MOV r0, [data_ptr+20]
    XOR r0, [match_ptr+20]
    BSF r0, r1
    JZ probably_24_or_more_bytes_are_equal
    ADD r1,20
    MOV r3,4*8  ; r3/8 = match length
    
    MOV r0, [data_ptr+16]
    XOR r0, [match_ptr+16]
    BSF r0, r2
    CMOVZ r3,r1
    ADD r2,16
    
    MOV r0, [data_ptr+12]
    XOR r0, [match_ptr+12]
    BSF r0, r1
    CMOVZ r3,r2
    ADD r1,12
    ...
    SHR r3,3  ; now r3=match length


    SSE 4.2 PCMPESTRI
    instruction can also be used to find number of equal bytes:
    Code:
    MovDqU xmm0, [data_ptr]; find the first *different* bytes, hence negative polarity
    PcmpEstrI xmm0, [match_ptr], EQUAL_EACH + NEGATIVE_POLARITY
    ja all_16_bytes_are_equal
    ; here ECX holds the number of equal bytes
    According to Agner, PCMPESTRI is rather slow. We can replace it with faster PCMPISTRI and lose a few matches containing 0 bytes. Or we can employ SSE2 PCMPEQB instruction:
    Code:
    MovDqU xmm0, [data_ptr]
    PCMPEQB xmm0, [match_ptr]
    ; here xmm0 equal bytes contains FF and non-equal are 0's
    The question is how to combine all those bits now in order to get something useful. For example, it would be great to combine results of two 16-byte PCMPEQB comparisons into single 32-bit GPR register containing just 1 bit per source byte and then use BSF to find the first non-zero bit:

    Code:
    ; PMADDUBSW/PHADDW require SSSE3, i.e. Core2/Bulldozer
    Mask DB 1,2,4,8,16,32,64,128,1,2,4,8,16,32,64,128
    MovDqU xmm0, [data_ptr]
    MovDqU xmm1, [data_ptr+16]
    PCMPEQB xmm0, [match_ptr]
    PCMPEQB xmm1, [match_ptr+16]
    PABSB xmm0  ; contains 1's for equal bytes, 0s for non-equal ones
    PABSB xmm1
    PMADDUBSW xmm0, Mask ; words contains 0/1+0/2, 0/4+0/8... with non-0s for equal bytes, 0s for non-equal ones
    PMADDUBSW xmm1, Mask
    PHADDW xmm0, xmm1    ; 8 words contains 0/1+0/2+0/4+0/8, ... with non-0s for equal bytes, 0s for non-equal ones
    PHADDW xmm0, xmm0    ; 4 words contains 0/1+0/2+...+0/128 each
    PACKUSWB xmm0, xmm0  ; condense words into bytes
    MOV r1, xmm0
    NOT r1
    BSF r1, r1
    JZ all_32_bytes_are_equal
    ; now r1 contains number of first equal bytes
    According to my calculations, this code executes in 11 cycles due to lot of dependencies, but uses only 12 ariops. So it can be overlapped with something useful, like Huffman encoding of previous match. It also lowers register pressure so you can easily check 64 bytes or start checking next match position for lazy/multientry match finder or just for case the current match will fail
    Last edited by Bulat Ziganshin; 17th November 2013 at 14:52.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •