Hello
Not being satisfied enough with RC decompression speed, i had a try at my first Huffman encoder.
The results are interesting, and i would like to share a few counter-intuitive conclusions from this experiment.
First, the file :
http://img46.xooimage.com/files/2/4/3/huff0-155f0ef.zip
On the performance side, compression speed reaches 200MB/s.
And decoding speed is about 160MB/s.
Technically, this is a semi-static implementation, with data chunked into blocks of 32KB. Each block therefore has a header containing a weight-table.
Although this is quite faster than RC, i'm feeling there are still questions opened.
1) Decoder is slower than Encoder.
This one is a surprise.
Both encoder and decoder are relatively simple and behave nearly the same.
Encoder, however, needs also to produce the huffman tree, a task which consumes significant CPU share (200MB/s ==> 6500 trees / s).
But nonetheless, the decoder, which does not need this part, is slower.
The only explanation i could come up with is the potential impact of decoding table.
Although it is quite small (64KB) for an L2 cache, it is quite big for an L1 cache.
Therefore, i have to expect that many cache hit will be in L2 rather than L1.
But is that really satisfying enough as an answer ?
After all, LZP2 uses a decoding table (pointers) of 256KB which fits into L2 rather than L1. But it still decodes much faster (>300MB/s).
2) Huffman encoder compress better than RC Encoder.
This shouldn't be, as RC precision is much higher than Huffman.
In fact, this one is easier to explain.
Indeed, if you compare huff0 and range0 on some test files, you'll notice Huff0 compress slightly better than Range0 on relatively "regular" data (enwik), and quite much better on large binary data with a lot of pattern change.
This is all about pattern update. Because the Huffman Tree header is much smaller than the frequency count header of range0, it is possible to lower the size of each block, improving pattern adaptation.
That's a side effect of semi-static implementation comparison.
I don't expect this conclusion to hold true for dynamic implementations.
Regards

Reply With Quote