Results 1 to 9 of 9

Thread: Fast [delta] compression of sets with small domains

  1. #1
    Member
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    17
    Thanks
    23
    Thanked 0 Times in 0 Posts

    Fast [delta] compression of sets with small domains

    I'm looking for a fast (especially fast on the decompression side) way to compress sets with small domains. Let me be more concrete: I'd like to compress a set representing some fixed size subset of all bytes. For example, 64 values out of the 256 possible bit values. You can think of as a set of 64 single-byte characters.

    A simple uncompressed representation would be a 256-bit (32 byte) bitmap, one bit for each possible byte. I think this is optimal already for randomly distributed sets of size 128, since in that case every bit has value 1 with probability 0.5. For smaller values like 64 or 32 elements, however, it seems like we could do better. Even for 64 randomly distributed elements, it seems like the entropy per symbol is still ~0.81 bits, so 256 * 0.811 = ~208 is the best we can do? In that case it doesn't seem like there is much low hanging fruit here.

    Set Delta

    So then then the other thing I'm interested in is compressing a series of such sets, where the sets are very likely to be similar. Consider, for example, the sets induced by breaking enwik9 into 8 KB chunks and for each chunk creating the set of the most frequent 64 characters in that : on average only ~7.25 elements change from one set to the next. Said another, the xor of the bitmaps for one set compared to the next generally have only 7.25 * 2 = ~14.5 bits set. Working from the bitmap representation, what's a good way to compress such a bitmap of mostly zeros without using per-bit arithmetic coding or anything per-bit, really?

    I could also abandon the bitmap representation and just explicitly store the list of indexes that were added and removed, which I guess would be ~14.5 bytes for the indexes and a couple more bytes to store the sizes of the lists. Another option would be some kind of bit-wise RLE encoding, encoding runs of zeros?

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    build the stat model of data you are going to compress. without it, it's impossible to predict which method will be better. describe what youк speed goals. for compression of bits you can use huffman encoding of entire bytes with autogenerated huffman codes

  3. #3
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    If you know in advance that you need to retrieve 64 or 32 values you don't need to record the whole 256 bits, just stop once you know all the values or when the number of remaining unkown values is equal to the remaining bits. This way you'll just send between 32 (or 64) and 255 bits (256 is never needed).
    ie 4 values to retrieve out of 8, 1 marks good value and 0 marks unused one.
    011011xx (in this case no need to send xx since four 1 have been send)
    00100xxx (in this case no need to send xxx since the remaining bits and the number of unknow values are both 3)

  4. #4
    Member
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    17
    Thanks
    23
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by caveman View Post
    If you know in advance that you need to retrieve 64 or 32 values you don't need to record the whole 256 bits, just stop once you know all the values or when the number of remaining unkown values is equal to the remaining bits. This way you'll just send between 32 (or 64) and 255 bits (256 is never needed).
    ie 4 values to retrieve out of 8, 1 marks good value and 0 marks unused one.
    011011xx (in this case no need to send xx since four 1 have been send)
    00100xxx (in this case no need to send xxx since the remaining bits and the number of unknow values are both 3)
    That's a good point, although I don't think it will save much in general. For example even with only 32 bits set out of 256, the last set bit will probably be ~8 bits from the end, so you'll still generally have to transfer ~240+ bits.

    For the special case of characters in enwik9, this may actually work quite well since characters with the top bit set (> 127) are quite rare in english text.

  5. #5
    Member
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    17
    Thanks
    23
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    build the stat model of data you are going to compress. without it, it's impossible to predict which method will be better. describe what youк speed goals. for compression of bits you can use huffman encoding of entire bytes with autogenerated huffman codes
    I gave some statistics above for the case of characters in blocks of enwik9 chunks. In general the chunk to chunk difference in the set of the top 64 characters is only ~7 characters. That is, ~57 of the characters in set N+1 are the same as those for set N.

    I'm hoping to decode this in something like a few cycles per set element. In the case where the set is very similar to the last one (or perhaps some earlier one other than the last one) it would be nice if the time was proportional to the number of changed elements, rather than the whole set size - and still a few cycles (say < 10 ns on modern hardware) per element.

  6. #6
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    If there is enough commonality, you could build/keep a table of up to 256 possible lists and specify which list with one byte. For the first list (maybe a few more), you would send the whole bit map or guess. After that, add the list for blocks if they are different, perhaps different by at least two or three characters. They when specifying a bitmap, you could send 1. the table number 2. the number of differences and 3. the list of differences. If you do it enough times and the data is close enough, you could get it down so that it's often only 3 or 4 bytes to send the list and it would be easy to process, but it would use more RAM.
    Last edited by Kennon Conrad; 5th March 2015 at 13:23.

  7. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    PSHUFB (5th March 2015)

  8. #7
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    1
    Thanks
    0
    Thanked 1 Time in 1 Post
    With sets of 64 values out of 256 with an average change of 7 values between sets:
    -uncompressed full bit map (256 bits) = 32 bytes
    -entropy of full bit map (64 out of 256 bits) = ~204 bits = ~26 bytes

    -count of changes (~6 bits) * (index to remove (6 bit) + byte to add (8 bit)) = ~98 bits = ~13 bytes.
    -count of changes (~6 bits) + entropy of delta bitmap (14 out of 256 bits) = ~82 bits = ~11 bytes. ~10 bytes when you compress the count.

    In your sample case, using simple delta enconding saves ~50% compared to the compressed full bitmap and it still can be compressed.

    Keep the set of your 64 values in an array/list. Do it the "index to remove and byte to add" stuff and add the new byte always either at the start (OR the end). This way you have a move-to-front (OR -end) transform for free, possibly helping the following compression stage.
    The entropy coder should code the counts (high probability of count around 7. delta encode the counts?), the indexes (high probability of index at start (OR the end) of the array. delta encode the indexes?) and the bytes to add (take the global probability of all bytes excluding the bytes already in the array).

    And for the speed... good luck...

  9. The Following User Says Thank You to opi2k2 For This Useful Post:

    PSHUFB (5th March 2015)

  10. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Pack all the bitmaps together and apply standard compression to it. Just an order 0 context model will probably do pretty well.

  11. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    PSHUFB (5th March 2015)

  12. #9
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Pack all the bitmaps together and apply standard compression to it. Just an order 0 context model will probably do pretty well.
    I was curious, so I put the bitmaps in a file and tried fpaq. If I understand correctly, fpaq is an order 0 arithmetic coder that works on bytes rather than bits. Bitmaps for 8K byte sections of enwik9 (as suggested by the OP) compressed from 4,000,000 bytes to 1,742,596. So apparently working on bytes here gives a better result. Then I ran fpaq on the deltas and got 1,101,072 bytes, or an average of 8.81 bytes per 32 byte bitmap.
    Last edited by Kennon Conrad; 6th March 2015 at 21:14. Reason: typos

  13. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Matt Mahoney (6th March 2015)

Similar Threads

  1. FreeArc compression suite (4x4, Tornado, REP, Delta, Dict...)
    By Bulat Ziganshin in forum Data Compression
    Replies: 554
    Last Post: 26th September 2018, 02:41
  2. Intra-Package Delta Compression
    By comp1 in forum Data Compression
    Replies: 0
    Last Post: 3rd March 2014, 20:37
  3. Patch/delta compression?
    By cbloom in forum Data Compression
    Replies: 3
    Last Post: 4th September 2012, 20:10
  4. "Extreme" compression of DNS domains
    By nickety in forum Data Compression
    Replies: 20
    Last Post: 22nd October 2011, 01:20
  5. Replies: 1
    Last Post: 13th May 2009, 10:46

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •