+ Reply to Thread
Page 2 of 2 FirstFirst 12
Results 31 to 34 of 34

Thread: How fast should be a range coder ?

  1. #31
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    400
    Hello

    Not being satisfied enough with RC decompression speed, i had a try at my first Huffman encoder.

    The results are interesting, and i would like to share a few counter-intuitive conclusions from this experiment.
    First, the file :
    http://img46.xooimage.com/files/2/4/3/huff0-155f0ef.zip

    On the performance side, compression speed reaches 200MB/s.
    And decoding speed is about 160MB/s.
    Technically, this is a semi-static implementation, with data chunked into blocks of 32KB. Each block therefore has a header containing a weight-table.

    Although this is quite faster than RC, i'm feeling there are still questions opened.
    1) Decoder is slower than Encoder.
    This one is a surprise.
    Both encoder and decoder are relatively simple and behave nearly the same.
    Encoder, however, needs also to produce the huffman tree, a task which consumes significant CPU share (200MB/s ==> 6500 trees / s).
    But nonetheless, the decoder, which does not need this part, is slower.

    The only explanation i could come up with is the potential impact of decoding table.
    Although it is quite small (64KB) for an L2 cache, it is quite big for an L1 cache.
    Therefore, i have to expect that many cache hit will be in L2 rather than L1.

    But is that really satisfying enough as an answer ?
    After all, LZP2 uses a decoding table (pointers) of 256KB which fits into L2 rather than L1. But it still decodes much faster (>300MB/s).

    2) Huffman encoder compress better than RC Encoder.
    This shouldn't be, as RC precision is much higher than Huffman.

    In fact, this one is easier to explain.
    Indeed, if you compare huff0 and range0 on some test files, you'll notice Huff0 compress slightly better than Range0 on relatively "regular" data (enwik), and quite much better on large binary data with a lot of pattern change.

    This is all about pattern update. Because the Huffman Tree header is much smaller than the frequency count header of range0, it is possible to lower the size of each block, improving pattern adaptation.

    That's a side effect of semi-static implementation comparison.
    I don't expect this conclusion to hold true for dynamic implementations.


    Regards
    Last edited by Cyan; 15th November 2009 at 21:04.

  2. #32
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    1,895
    > Not being satisfied enough with RC decompression speed,
    > i had a try at my first Huffman encoder.

    As I said, there're ways to improve the rc speed, especially
    when its static. In fact, nobody uses a plain arithmetic coder
    with multiplications and division for codecs like jpeg and h264.

    > 1) Decoder is slower than Encoder.
    > This one is a surprise.

    Not really, its always true for near-symmetric compression schemes.
    The problem is caused by sequential nature of the decoder - you
    have to decode one symbol to do something about the next one.
    So the decoder gets stuck in the symbol dependency chain, while
    in encoder symbols don't depend on any computations.

    Then, another problem is related to the common read/write asymmetry.
    Writing is usually slower than reading, and decoder has to write
    more data than it reads - due to the nature of the algorithm.

    > Although it is quite small (64KB) for an L2 cache, it is
    > quite big for an L1 cache.

    Yeah, that matters too. For example, you can try sorting
    the alphabet in such a way that symbols with high frequencies
    would be clustered together - both for rc and huffman.
    Also, using a smaller lookup table with additional lookups for
    less frequent symbols might be faster despite being more complex.

    > After all, LZP2 uses a decoding table (pointers) of 256KB
    > which fits into L2 rather than L1. But it still decodes much
    > faster (>300MB/s).

    Don't forget about the compression ratio too.
    Like, try comparing the speed of your implementations with
    random data and with a zero-filled file.

    > 2) Huffman encoder compress better than RC Encoder.
    > This shouldn't be, as RC precision is much higher than Huffman.

    Yeah, if only you could implement it the same way.
    Its just wrong to compare the RC with 512k blocks and huffman with 32k blocks.

    > Because the Huffman Tree header is much smaller than the
    > frequency count header of range0, it is possible to lower
    > the size of each block, improving pattern adaptation.

    That's not true actually. Its possible to efficiently encode
    both huffman trees and frequency tables.

    > That's a side effect of semi-static implementation comparison.
    > I don't expect this conclusion to hold true for dynamic implementations.

    Its not really true for static either.
    But huffman might be really a better choice for speed-oriented
    implementations, as it only adds 3-5% redundancy on average,
    but works faster.

    But on other hand, speed difference is not that big, but
    arithmetic coding is much more flexible.
    For example, bitcode schemes in popular codecs (mp3,aac,jpeg etc)
    are too complex to explain in a few words, while even a trivial
    statistical model (like order0) with adaptive rc would be much simpler
    and might even provide better compression (in fact, maybe even speed).

    Also, blockwise coding is inconvenient for many applications
    (like network protocols), while adaptive bitcode schemes are
    much slower than rc.

    Anyway, I really wonder how are you going to apply a compression
    algorithm with >100MB/s processing speed - in most cases that
    would be impossible to reach due to i/o speed limits.

  3. #33
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    400
    Actually I already mentioned one way to improve the rangecoding speed,
    which is especially applicable when its based on a static model.
    You can just use multiple rangecoders at once.
    Encode symbol c[0] with rc[0], then c[1] with rc[1], ... then c[N] with rc[0] again.
    It can be even vectorized to some extent.
    Do you mean actually multi-threading ?

  4. #34
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    1,895
    No, parallel rangecoding works faster even within a single thread.
    As I said, its also possible even to vectorize the processing - eg.
    IntelC is able to compile something like
    for( i=0; i<8; i++) rc[i].Renorm();
    to a bunch of SSE2 instructions.
    But even without that its always good to interleave the independent
    code sequences.

    That's just one possibility though - there're quite a few other things.
    Like adding some symbol ranking and switching to bitwise coding
    (which only requires a single multiplication and no divisions - even
    for decoding). Or working with rc state in logarithmic domain - where
    rnew = (range/total)*freq;
    would become
    rnew = range - total + freq;
    well, quite a few lookup tables would be necessary, but its still an option.

+ Reply to Thread
Page 2 of 2 FirstFirst 12

Similar Threads

  1. M1 - Optimized demo coder
    By toffer in forum Data Compression
    Replies: 189
    Last Post: 22nd July 2010, 00:49
  2. A weird order8(?) CM coder
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 23rd July 2009, 22:08
  3. flzp_ac2 (flzp + an order-2 arithmetic coder)
    By inikep in forum Data Compression
    Replies: 4
    Last Post: 25th June 2008, 22:37
  4. LZC - new fastr LZ coder compressor -
    By Nania Francesco in forum Forum Archive
    Replies: 70
    Last Post: 16th November 2007, 15:04

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts