# Thread: Fast [delta] compression of sets with small domains

1. ## Fast [delta] compression of sets with small domains

I'm looking for a fast (especially fast on the decompression side) way to compress sets with small domains. Let me be more concrete: I'd like to compress a set representing some fixed size subset of all bytes. For example, 64 values out of the 256 possible bit values. You can think of as a set of 64 single-byte characters.

A simple uncompressed representation would be a 256-bit (32 byte) bitmap, one bit for each possible byte. I think this is optimal already for randomly distributed sets of size 128, since in that case every bit has value 1 with probability 0.5. For smaller values like 64 or 32 elements, however, it seems like we could do better. Even for 64 randomly distributed elements, it seems like the entropy per symbol is still ~0.81 bits, so 256 * 0.811 = ~208 is the best we can do? In that case it doesn't seem like there is much low hanging fruit here.

Set Delta

So then then the other thing I'm interested in is compressing a series of such sets, where the sets are very likely to be similar. Consider, for example, the sets induced by breaking enwik9 into 8 KB chunks and for each chunk creating the set of the most frequent 64 characters in that : on average only ~7.25 elements change from one set to the next. Said another, the xor of the bitmaps for one set compared to the next generally have only 7.25 * 2 = ~14.5 bits set. Working from the bitmap representation, what's a good way to compress such a bitmap of mostly zeros without using per-bit arithmetic coding or anything per-bit, really?

I could also abandon the bitmap representation and just explicitly store the list of indexes that were added and removed, which I guess would be ~14.5 bytes for the indexes and a couple more bytes to store the sizes of the lists. Another option would be some kind of bit-wise RLE encoding, encoding runs of zeros?

2. build the stat model of data you are going to compress. without it, it's impossible to predict which method will be better. describe what youĞº speed goals. for compression of bits you can use huffman encoding of entire bytes with autogenerated huffman codes

3. If you know in advance that you need to retrieve 64 or 32 values you don't need to record the whole 256 bits, just stop once you know all the values or when the number of remaining unkown values is equal to the remaining bits. This way you'll just send between 32 (or 64) and 255 bits (256 is never needed).
ie 4 values to retrieve out of 8, 1 marks good value and 0 marks unused one.
011011xx (in this case no need to send xx since four 1 have been send)
00100xxx (in this case no need to send xxx since the remaining bits and the number of unknow values are both 3)

4. Originally Posted by caveman
If you know in advance that you need to retrieve 64 or 32 values you don't need to record the whole 256 bits, just stop once you know all the values or when the number of remaining unkown values is equal to the remaining bits. This way you'll just send between 32 (or 64) and 255 bits (256 is never needed).
ie 4 values to retrieve out of 8, 1 marks good value and 0 marks unused one.
011011xx (in this case no need to send xx since four 1 have been send)
00100xxx (in this case no need to send xxx since the remaining bits and the number of unknow values are both 3)
That's a good point, although I don't think it will save much in general. For example even with only 32 bits set out of 256, the last set bit will probably be ~8 bits from the end, so you'll still generally have to transfer ~240+ bits.

For the special case of characters in enwik9, this may actually work quite well since characters with the top bit set (> 127) are quite rare in english text.

5. Originally Posted by Bulat Ziganshin
build the stat model of data you are going to compress. without it, it's impossible to predict which method will be better. describe what youĞº speed goals. for compression of bits you can use huffman encoding of entire bytes with autogenerated huffman codes
I gave some statistics above for the case of characters in blocks of enwik9 chunks. In general the chunk to chunk difference in the set of the top 64 characters is only ~7 characters. That is, ~57 of the characters in set N+1 are the same as those for set N.

I'm hoping to decode this in something like a few cycles per set element. In the case where the set is very similar to the last one (or perhaps some earlier one other than the last one) it would be nice if the time was proportional to the number of changed elements, rather than the whole set size - and still a few cycles (say < 10 ns on modern hardware) per element.

6. If there is enough commonality, you could build/keep a table of up to 256 possible lists and specify which list with one byte. For the first list (maybe a few more), you would send the whole bit map or guess. After that, add the list for blocks if they are different, perhaps different by at least two or three characters. They when specifying a bitmap, you could send 1. the table number 2. the number of differences and 3. the list of differences. If you do it enough times and the data is close enough, you could get it down so that it's often only 3 or 4 bytes to send the list and it would be easy to process, but it would use more RAM.

7. ## The Following User Says Thank You to Kennon Conrad For This Useful Post:

PSHUFB (5th March 2015)

8. With sets of 64 values out of 256 with an average change of 7 values between sets:
-uncompressed full bit map (256 bits) = 32 bytes
-entropy of full bit map (64 out of 256 bits) = ~204 bits = ~26 bytes

-count of changes (~6 bits) * (index to remove (6 bit) + byte to add (8 bit)) = ~98 bits = ~13 bytes.
-count of changes (~6 bits) + entropy of delta bitmap (14 out of 256 bits) = ~82 bits = ~11 bytes. ~10 bytes when you compress the count.

In your sample case, using simple delta enconding saves ~50% compared to the compressed full bitmap and it still can be compressed.

Keep the set of your 64 values in an array/list. Do it the "index to remove and byte to add" stuff and add the new byte always either at the start (OR the end). This way you have a move-to-front (OR -end) transform for free, possibly helping the following compression stage.
The entropy coder should code the counts (high probability of count around 7. delta encode the counts?), the indexes (high probability of index at start (OR the end) of the array. delta encode the indexes?) and the bytes to add (take the global probability of all bytes excluding the bytes already in the array).

And for the speed... good luck...

9. ## The Following User Says Thank You to opi2k2 For This Useful Post:

PSHUFB (5th March 2015)

10. Pack all the bitmaps together and apply standard compression to it. Just an order 0 context model will probably do pretty well.

11. ## The Following User Says Thank You to Matt Mahoney For This Useful Post:

PSHUFB (5th March 2015)

12. Originally Posted by Matt Mahoney
Pack all the bitmaps together and apply standard compression to it. Just an order 0 context model will probably do pretty well.
I was curious, so I put the bitmaps in a file and tried fpaq. If I understand correctly, fpaq is an order 0 arithmetic coder that works on bytes rather than bits. Bitmaps for 8K byte sections of enwik9 (as suggested by the OP) compressed from 4,000,000 bytes to 1,742,596. So apparently working on bytes here gives a better result. Then I ran fpaq on the deltas and got 1,101,072 bytes, or an average of 8.81 bytes per 32 byte bitmap.

13. ## The Following User Says Thank You to Kennon Conrad For This Useful Post:

Matt Mahoney (6th March 2015)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•