Results 1 to 15 of 15

Thread: Bitslicing and compression

  1. #1
    Member
    Join Date
    Jul 2013
    Location
    Stanford, California
    Posts
    24
    Thanks
    7
    Thanked 2 Times in 2 Posts

    Bitslicing and compression

    I'm wondering, for which compression algorithms could a bitslicing approach be used to achieve greater performance (when a lot of data is to be compressed in parallel)? Are any of the compression algorithms deterministic enough to do this? Can they be modeled as a series of bitwise logic operations? What about a core subset of those algorithms?

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Talking

    You can easily slice a file into 8 files bijectively and then compress each separately then combine them bijectively when done. However for most common files the compression likely to be worse. However I am sure for come types of files this could be quite good.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    The general rule when designing data structures is to store together things that are used together. The lower level the more that rule is prevalent. Therefore it would be highly unusual for a file format to store unrelated bits in the same byte throughout the entire file.

    A much more sensible and widely used approach is just splitting the file/ stream into chunks and processing the chunks separately.

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I'm not sure what the exact definition of bitslicing is, but it looks like it could be interpreted broadly as making use of SIMD. ANS can be done with SIMD. Naturally, each operation can't depend on the previous one -- so the model has to be stationary across each concurrent timestep. Then, you have to be able to aggregate the concurrent outputs into a stream. ANS produces output in discrete "packets" that you can interleave. People have done this -- see the ANS threads. I'm not sure how many examples of SIMD there are in compression. As Piotr said, you can make compression embarrassingly parallel by blocking the input, but that has implications that you may not want. The more interesting question is how you can parallelize and still get exactly what you want.
    Last edited by nburns; 12th April 2014 at 05:20.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    if you mean jumpless operation that opens the door for SIMD, then it's a broad range of entropy coders (bitcoder, hufman, fse, arith) and at least one LZ-style thing - libssc. afaik some context modeling in paq can also employ SIMD. also, there is a CUDA ST8 sorter, it may be implemented with SIMD using scatter/gather commands

  6. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    if you mean jumpless operation that opens the door for SIMD, then it's a broad range of entropy coders (bitcoder, hufman, fse, arith) and at least one LZ-style thing - libssc. afaik some context modeling in paq can also employ SIMD. also, there is a CUDA ST8 sorter, it may be implemented with SIMD using scatter/gather commands
    I think it has more to do with dependencies between instructions. If one instruction depends on the result of a previous instruction, they have to be sequenced. For example, the arithmetic version of ANS is based on Horner's rule for computing a polynomial, so it's like this:

    Code:
    for(i=0; i<n; i++) {
       y = (y * x) + c[i]; 
    }
    Each iteration needs the value of y from the previous iteration. I don't think you can parallelize that. But if you rewrote it to compute the terms separately and add, there are opportunities for SIMD there, because the terms don't depend on each other.
    Last edited by nburns; 12th April 2014 at 14:25.

  7. #7
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I was guessing that maybe bit-slicing meant doing radically SIMD-style style computation using bitwise ops on registers full of bits in parallel.

    For example, with just 64-bit registers and normal bitwise ops---no SIMD instructions---you can implement 64 parallel 3-bit counters "vertically" by using one register to hold the ones bits for all 64 counters, another for the correspondings twos bits, and another for the fours's, plus a couple more for the corresponding carries out of the ones and 2's. (Then you just AND the ones registers of the inputs to get the corresponding carries out, and XOR them to get the output ones, etc. and repeat up the place values, so it's a very few instructions per place value.)

    That's a standard trick in writing fast cellular automata, and I assume for a variety of other things where you need to do trivially parallelizable stuff on many little bitty pieces of data.

  8. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    The Wikipedia page on "Bit Slicing" starts about talking about bit-slice hardware construction, which is related but different, but then says:

    In more recent times, the term bitslicing was re-coined by Matthew Kwan [1] to refer to the technique of using a general purpose CPU to implement multiple parallel simple virtual machines using general logic instructions to perform Single Instruction Multiple Data operations. This technique is also known as SWAR, SIMD Within A Register.

    This was initially in reference to Eli Biham's 1997 paper A Fast New DES Implementation in Software,[2] which achieved significant gains in performance of DES by using this method.

  9. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I suspect that "bit slicing" here usually means something more like "bitfield slicing" (you don't have to get down to single bits), and SWAR is a better term but unfortunately an obscure acronym.

    Despite my previous bitwise-parallel example, it's usually more useful to do multi-bit bitfields, e.g., using a 64-bit registers as parallel arrays of 16 4-bit registers. When you do that, you can do some limited non-bitwise ops like adds and small shifts, and preserve the abstraction of little registers, as long as nothing overflows or underflows into its neighboring, entirely conceptual "sub-registers."

    The bitfields don't have to be powers of two---e.g., you can pack 3 9-bit fields into 32-bit registers---and they don't have to be of uniform size. (E.g., you can add parallel arrays of various weird-sized little bitfields with add instructions on registers, so long as you need to do the same operations on all of them.) You just have to make sure none of your conceptual "fields" "overflows" or (underflows) into adjacent conceptual "fields".

    This can lead to some fast but seriously unreadable code.

  10. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Paul:

    Based on wikipedia and whatever else came up in my search, I concluded that bitslicing can mean both hardware-supported SIMD and pure-software SIMD.

    It sounds to me like somebody had the idea to do SIMD "in a register" -- i.e. on a regular CPU, rather than on special purpose chips -- and the result was both MMX and the kind of software-only hacks that you're talking about.

  11. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns:

    It sounds to me like somebody had the idea to do SIMD "in a register" -- i.e. on a regular CPU, rather than on special purpose chips -- and the result was both MMX and the kind of software-only hacks that you're talking about.
    I'm guessing that a whole lot of people have come up with the idea, here and there, and that it goes way, way back. I would guess that people were doing SIMD within registers almost from day one, and vector processing was just adding a little hardware to make it easier.

    It's the kind of thing people are going to think of when they're doing various concrete things in software, like implementing Binary Coded Decimal, and only later throwing a little hardware at it.

    I came up with it as an undergrad trying to come up with a fast implementation of Conway's Game of Life (cellular automaton), before I'd ever heard of vector processing hardware, and later found out it was a standard technique, and still later found out there was hardware like that. It's sort of implicit in the classic version of Life in Smalltalk where you just use BitBlt (Bit Block Transfer) operations to XOR and AND together a handful of whole bitmaps. (By bitmaps, I mean literal one-bit-per-pixel bitmaps.) And really, it's the probably the basic idea underlying decent software (or microcoded) implementations of various things like BitBlt and BCD in the first place.

    I'd be surprised if SIMD-Within-A-Register wasn't more common in the 1950's when memory was so tight and processors were so slow---people often had to pack data pretty tightly, and avoiding unpacking and repacking too much was worth a lot of thought.
    Last edited by Paul W.; 16th April 2014 at 17:53.

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Bit slicing was an architecture used in the 1970's where you had a 4 bit ALU and register set on a chip with carry in and carry out pins so that you could wire them together to build an 8, 16, 32, etc. bit ALU. Then you added some memory, I/O, and control logic to build your computer. Now, of course, transistors are smaller so everything is on one chip.
    Last edited by Matt Mahoney; 16th April 2014 at 18:25.

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by nburns View Post
    Code:
    for(i=0; i<n; i++) {
       y = (y * x) + c[i]; 
    }
    Each iteration needs the value of y from the previous iteration. I don't think you can parallelize that. But if you rewrote it to compute the terms separately and add, there are opportunities for SIMD there, because the terms don't depend on each other.
    Like this:
    Code:
    for (i=0; i<n; i+=4)
      y = y*x*x*x*x + (c[i]*x*x*x + c[i+1]*x*x) + (c[i+2]*x + c[i+3]);
    where the powers of x (modulo the word size) are constants computed in advance. The 4 multiplications are done in parallel. Then 2 additions are done in parallel in the next step, then finally 2 sequential additions. Of course you can unroll further.

  14. #14
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    nburns:



    I'm guessing that a whole lot of people have come up with the idea, here and there, and that it goes way, way back. I would guess that people were doing SIMD within registers almost from day one, and vector processing was just adding a little hardware to make it easier.

    It's the kind of thing people are going to think of when they're doing various concrete things in software, like implementing Binary Coded Decimal, and only later throwing a little hardware at it.

    I came up with it as an undergrad trying to come up with a fast implementation of Conway's Game of Life (cellular automaton), before I'd ever heard of vector processing hardware, and later found out it was a standard technique, and still later found out there was hardware like that.
    Yup. That's reinventing the wheel.

    It's sort of implicit in the classic version of Life in Smalltalk where you just use BitBlt (Bit Block Transfer) operations to XOR and AND together a handful of whole bitmaps. (By bitmaps, I mean literal one-bit-per-pixel bitmaps.) And really, it's the probably the basic idea underlying decent software (or microcoded) implementations of various things like BitBlt and BCD in the first place.

    I'd be surprised if SIMD-Within-A-Register wasn't more common in the 1950's when memory was so tight and processors were so slow---people often had to pack data pretty tightly, and avoiding unpacking and repacking too much was worth a lot of thought.
    There was doubtless a lot of pressure to be efficient in the 1950s, but the reasons might have been different from today. Packing data tightly made it less likely that you'd pull a muscle when lifting boxes of punch cards, and you'd want to be careful not to waste computation, because vacuum tubes didn't last forever and I'm sure they weren't cheap.
    Last edited by nburns; 16th April 2014 at 23:27.

  15. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Like this:
    Code:
    for (i=0; i<n; i+=4)
      y = y*x*x*x*x + (c[i]*x*x*x + c[i+1]*x*x) + (c[i+2]*x + c[i+3]);
    where the powers of x (modulo the word size) are constants computed in advance. The 4 multiplications are done in parallel. Then 2 additions are done in parallel in the next step, then finally 2 sequential additions. Of course you can unroll further.
    Thanks. That seems to agree with my point that parallelization requires finding instructions (for SIMD, identical instructions) that don't depend on each other.

    The formula for ANS has a few more details thrown in, too, which make it much harder to extract parallelism the way you did here.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •