Page 1 of 8 123 ... LastLast
Results 1 to 30 of 221

Thread: Fast LZ4+EC compressor

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    Fast LZ4+EC compressor

    a few months ago i've started the thread http://encode.ru/threads/1824-asm-Jumpless-bitcodec . Later, Cyan has published his extremely fast ANS coder employing the same idea of jumpless bit I/O. it's even outperformed known Huffman implementations, and i believe it's fast because it's jumpless, not because it's ANS. huffman encode is just "putbits(hufcode[i],hufbits[i])" that is simpler than any ANS operation

    so i believe that fastest EC for LZ would be a jumpless huffman. Also, as i said in the jumpless-bitcodec thread, interleaving LZ and EC parts would allow to hide some cache latencies - we can perform huffman encoding just while waiting for arriving of LZ match from the cache

    slow part of huffman is the tree rebuild operation. but even if it can't be optimized, we can just perform it less frequently - it should be ok for speed-optimized compressor

    tornado supports both ari (-c4) and huf (-c3) backends, so you can compare their compression ratios. it's about 0.5% on binaries, none on text - although it may be an effect of implemented encoding scheme, i.e. single EC-coded number representing both length and distance (i = code(length)*32+code(distance), 0<=i<512)
    Last edited by Bulat Ziganshin; 29th May 2014 at 22:47.

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    ANS has the advantage of being online. This means that is scales down to very small buffers. There's plenty of use for compression of buffers between 4 KB and 2 MB. At the lower region of this range, huffman wouldn't be able to compete.

    Also, I notice that everything in your post is about 'LZ', while the title mentions 'LZ4'.
    Do you think that LZ4 is a good starting point?

    Myself, I think it is, though I'd rather start from scratch, from a clean and simple code base effectively sacrificing some initial performance and portability for ease of understanding and modification.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i haven't studied any other ANS codecs, but at least FSE is offline - it builds the compression tables first. well, the same is true for huffman - you either perform 2 passes over data, or use semi-adaprive approach where old stats is used to encode new data. so, the only difference is that FSE encoding and decoding directions are differerent. it may be that FSE builds its tables faster than huffman

    i expect that implementation using single pass over data will allow to interleave EC coding and hash/buffer access latencies, thus improving overall speed

    yes, i thought about LZ4 since it's the fastest LZSS codec available. once again, in my thread i've proposed some SSE code to quickly find LZ matches. i believe that it will be faster than LZ4 pure C++ approach
    Last edited by Bulat Ziganshin; 30th May 2014 at 20:12.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    let's do some fun! for simplicity, i will write AVX2 x64 code

    first part is the match search code copied from here:
    Code:
    loop:
    MOV data, [data_ptr]
    IMUL hash, data, 123456791
    SHR hash, 20
    MOV match_ptr, [hash_table + hash*4]
    MOV [hash_table + hash*4], data_ptr
    
    CMP data, [match_ptr]
    JE match_found           ; 16 ticks delay = load+imul+shr.i+load+load,  but only 3 ariops
    
    no_match:
    ADD data_ptr, 1
    JMP loop
    
    match_found:
    ; compute match length
    MovDqU  ymm0, [data_ptr+4]
    PCmpEqB ymm0, [match_ptr+4]
    PMovMskB match_len, ymm0
    NOT match_len
    BSF match_len, match_len
    JZ all_32_bytes_are_equal
    
    ADD data_ptr, match_len
    JMP loop

    of course, those "ADD data_ptr" commands should be preceded by the actual huffman encoding:

    Code:
    ;; ENCODE_CHAR (*data_ptr)
    ;; 10 ariops, 14 fused mops total
    
    ; putbits (huf_code[ch], huf_bits[ch])
    MOVZX ch, [data_ptr]
    MOV ch_code, huf_code[ch]
    SHLX ch_code, ch_code, out_bits
    OR [out_ptr], ch_code
    
    ; (out_ptr*8+out_bits) += huf_bits[ch]
    ADD out_bits, huf_bits[ch]
    MOV out_shift, out_bits
    AND out_bits, 7
    SHR out_shift, 3
    ADD out_ptr, out_shift
    
    ; update huffman stats
    INC huf_counter[ch]
    INC huf_total
    JZ build_huf_table

    Code:
    ;; ENCODE_MATCH (match_len, data_ptr-match_ptr)
    ;; 18 ariops, 26 fused mops total
    
    MOV len_code, len_code_table[match_len]
    MOV dist, data_ptr
    SUB dist, match_ptr
    BSR dist_code, dist
    LEA ch, [len_code+dist_code+256]
    MOV code, huf_code[ch]
    MOV bits, huf_bits[ch]
    
    MOV extra_bits, len_bits_table[match_len]
    SHLX extra_bits, extra_bits, dist_code
    ADD extra_bits, dist
    SHLX extra_bits, extra_bits, bits
    ADD code, extra_bits
    ADD bits, dist_code
    ADD bits, len_xbits_table[match_len]
     
    ; putbits (code, bits)
    SHLX code, code, out_bits
    OR [out_ptr], code
    
    ; (out_ptr*8+out_bits) += bits
    ADD out_bits, bits
    MOV out_shift, out_bits
    AND out_bits,7
    SHR out_shift, 3
    ADD out_ptr, out_shift
    
     ; update huffman stats
    INC huf_counter[ch]
    INC huf_total
    JZ build_huf_table

    and of course, these 14/26 mops can be greatly interleaved with those 16 ticks delay up to execution of "JE match_found". we just need two different match search code blocks - one interleaved with the ENCODE_CHAR and another interleaved with the ENCODE_MATCH

    and since 80% of matches fail, we can further improve performance by prefetching [hash_table] and [match_ptr] for the data_ptr+1 and may be data_ptr+2. the main restriction is the register pressure, so x64 code may be much faster than x86 one

    i've also found that trick - if it's true, then we can use 64-bit code in 32-bit windows programs!!
    Last edited by Bulat Ziganshin; 1st June 2014 at 16:00.

  5. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (31st May 2014)

  6. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    I initially thought that EC meant "Error Correction" on reading the title.
    But from reading the posts, that's Entropy Coding...

    I, indeed, plan to release such a codec later this year, but that shouldn't slow you down into experimenting some useful ideas of yours.

    Due to an ongoing home change operation, very little time is remaining, and all of it is currently devoted to the new LZ4 streaming mode. I believe any learning is likely going to be useful for the next coder too, so it's the right time to "complete" LZ4.

    I guess you already know that Zhuff initially planned to be this next compressor, but a major redesign will be necessary to reach that goal. As a result of interaction with programming users of LZ4, I've become obsessed with memory management capabilities. A coder using temporary buffers and handling its own memory in the back of the programmer is no longer acceptable. I'm targeting single-pass decoding, which is not currently possible, and that's the major redesign part to complete before a release. It will also settle the compressed format.

    But once that's done, only the foundations will be there. There are a lot of other ideas to be included, and I believe there is ample room for multiple programmers to handle them properly.

    it's about 0.5% on binaries, none on text - although it may be an effect of implemented encoding scheme,
    I believe you are right.
    One of the side-effects of Huffman is to introduce distortions (hence less compression) whenever a probability to encode becomes both large and distant from a clean power of 2. The work-around to this unwanted side-effect is to introduce many symbols, so that each one of them will be allocated a relatively small probability, hence keeping the distortion low (since the distance to the closest power of 2 will also remain small). That's basically what tornado is doing (from reading your description), and that's definitely what brotli is doing.

    This work around is no longer required with an arithmetic/ANS coder.
    There is no such "distortion" if one symbol has a probability of, for example, 35% or 75%.
    So we can keep the number of symbols to decode "low".
    Less symbols also translates into smaller table sizes, and smaller header description.
    Both effects are important, and should be taken into consideration to fully benefit arithmetic/ANS.

    By applying an huffman-optimized encoding scheme (with large number of states) to an arit/ANS encoder, a large part of the advantage will be lost in the process.
    This was in essence the flaw within brotli's team analysis of ANS earlier this year (brotli handles more than 1000 symbols!)
    Last edited by Cyan; 1st June 2014 at 22:09.

  7. The Following User Says Thank You to Cyan For This Useful Post:

    Paul W. (1st June 2014)

  8. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    comparison of cabarc encoding scheme (len+dist combined) to the usual deflate scheme:

    1. requires only one EC encode/decode operation per match and needs bookkeeping for only one table
    2. improves compression using correlations between length and distance
    3. since practical cabarc scheme uses only 8*16 .. 16*32 of len*dist codes and deflate uses 32+32 that may be further increased, deflate has more accurate compression of each individual code. afaik, net effect is that deflate scheme provides slightly better compression for text files, while cabarc - for binaries
    4. has larger tables and therefore has slower construction, needs larger blocks to gather appropriate statistics or longer headers

    overall, we have a plenty of choices: huf/fse, static/dynamic, cabarc/deflate, c++/asm/sse and lots of metrics to measure

  9. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (1st June 2014)

  10. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    the entire compression loop is better represented in the following way:

    Code:
    encode match
    prepare match search cycle
    
    while match not found
      encode char
      prepare next match check
    
    //match_found
    find match len
    go to start
    even with the cached hash table, we should spend ~20 ticks (load+imul+shr.i+load.L2) before we can start to check the first match. it should be enough time to execute 26 fused mops of interleaved ENCODE_MATCH, plus 5 mops of match check loading, plus preparation of the future match checks and match len check. well, lot of execution resources will still remain unused at this time

    once loop is started, we can reduce delays to the minimum by prefetching data for future matches. and since the last load.L2 command is so slow, we need to perform full calculation of the next index ahead of time. 14 commands required for encode_char plus 5 mops required for match check can be executed in 5 cycles and we can execute cycle step in 5 ticks if full prefetching is used. so it's only a question of having enough registers

    once match was found, we perform the unpredicted branch, spending next 14 ticks in queue-filling delay. in order to fill out this delay, we can prefetch 32 bytes at [match_ptr] - if it will not decrease speed of the main cycle. then we will spend at least 6 ticks per 4 mops in the find_match_len code and finally we are ready to encode the match and start the next search. so, once match was found, we execute only 4 mops in 20 ticks

    so we have 3 different code parts:
    1. pre-cycle part executes ~40 mops in 20 cycles, utilizing ~50% of haswell EUs
    2. cycle part with deep pipelining executes ~20 mops in 5 cycles, utilizing 100% of EUs
    3. post-cycle part executes 4 mops in 20 cycles, utilizing 5% of EUs

    my benchmarks show that on binary files we have to encode 4-5 chars per one encoded match, so overall EU utilization will be ~50%. this means that with HT two compression threads may utilize one core at about 100%. on text files, i've seen 2:1 chars/matches ratio, so one compression thread should utilize only about 40% of EUs
    Last edited by Bulat Ziganshin; 1st June 2014 at 17:30.

  11. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Paul W. (1st June 2014)

  12. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    comparison of cabarc encoding scheme (len+dist combined) to the usual deflate scheme:

    1. requires only one EC encode/decode operation per match and needs bookkeeping for only one table
    2. improves compression using correlations between length and distance
    3. since practical cabarc scheme uses only 8*16 .. 16*32 of len*dist codes and deflate uses 32+32 that may be further increased, deflate has more accurate compression of each individual code. afaik, net effect is that deflate scheme provides slightly better compression for text files, while cabarc - for binaries
    4. has larger tables and therefore has slower construction, needs larger blocks to gather appropriate statistics or longer headers

    overall, we have a plenty of choices: huf/fse, static/dynamic, cabarc/deflate, c++/asm/sse and lots of metrics to measure
    Why not just implement a few coding modes instead of one and select the best one for each input block? Static blockwise-independent codes have the advantage that you don't have penalty at decoding for having more than one coding mode - you just select different decoding loop. For fast decision about selecting the best mode during encoding you can look at statistics of previous block and also probe a small subblock of current block. That's assuming you're doing anything different than greedy parsing. With greedy parsing you can blockwise split match finding and encoding and between them insert coding mode selection.



    Also some mode for record data should be useful IMO. LZMA has modelling of lowest bits of position (or something like that). That improves compression of eg offsets streams of preprocessed executables. Why not just restrict match offsets to multiplies of 4 (or 8 ) and do same for match lengths? That way you save 2 bits per offset and 2 bits per match length and on record data that should only improve compression ratio and improve speed. Also, with an additional bit of preprocessing of truecolor bitmaps (extending byte triplets to quadruplets) one can apply the trick to them and check whether that helps.
    Last edited by Piotr Tarsa; 1st June 2014 at 17:16.

  13. #9
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    In my memory ryg's ANS was dynamic, but the code is static. I haven't seen any code but this and FSE, it's likely that nobody tried it and that it doesn't make sense.
    I'd like to make a benchmark testing how do various schemes scale down to small buffers. inikeep's bench has a ton of linuxisms and doesn't compile here. TODO.

  14. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i haven't studied any other ANS codecs, but at least FSE is offline - it builds the compression tables first. well, the same is true for huffman - you either perform 2 passes over data, or use semi-adaprive approach where old stats is used to encode new data. so, the only difference is that FSE encoding and decoding directions are differerent. it may be that FSE builds its tables faster than huffman
    FSE might be able to generate its table faster because it doesn't have to sort symbols by frequency or build a tree. (Cyan's explanation makes sense, too.)

    Matt's fpaqc is online and it implements a variant of ANS over a binary alphabet. It's literally interchangeable with AC. I don't fully understand rANS, yet, but I think it might be possible to do online, too.

    Quote Originally Posted by m^2 View Post
    In my memory ryg's ANS was dynamic, but the code is static. I haven't seen any code but this and FSE, it's likely that nobody tried it and that it doesn't make sense.
    I'd like to make a benchmark testing how do various schemes scale down to small buffers. inikeep's bench has a ton of linuxisms and doesn't compile here. TODO.

  15. #11
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    ANS faces the same problem as any other entropy coder, arithmetic included, regarding "online" compression (i.e. : learning while compressing, without computed table) : its distribution must be recorded into a table, and as a result, updating the table at each symbol is very expensive, except for ... the binary version.

    There is no rANS exception, it will face the same problem as any other multi-symbol arithmetic encoder.

  16. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Why not just implement a few coding modes instead of one and select the best one for each input block?
    we haven't implemented even one yet

    also we are looking for the algo that spends around 10 cpu clocks per input byte, so there is hardly any time to perform any pre/post-processing

  17. #13
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Also some mode for record data should be useful IMO. LZMA has modelling of lowest bits of position (or something like that). That improves compression of eg offsets streams of preprocessed executables. Why not just restrict match offsets to multiplies of 4 (or 8 ) and do same for match lengths? That way you save 2 bits per offset and 2 bits per match length and on record data that should only improve compression ratio and improve speed. Also, with an additional bit of preprocessing of truecolor bitmaps (extending byte triplets to quadruplets) one can apply the trick to them and check whether that helps.
    My simple LZH does exactly this -

    For each block I send "offset shift" , which is 0,1, or 2. Decoded offsets are then left-shifted by that number. It does indeed help compression on some binary files.

    The annoying thing is detecting when it should be done. It requires some kind of multi-pass parsing, which kind of sucks.

    If you're not going for maximum speed, then just entropy coding the bottom bits is preferable. (like LZX,LZMA, and my LZHLW).

    To address some of the other posts -

    There's no fast adaptive ANS that I know of at the moment. However, RANS has much faster table build times than Huffman. This means it could be easily used in a "semi-adaptive" way (aka "deferred summation"). That is, rebuild the table using the last N counts seen, and start with N=512 then 1024, etc..
    Last edited by cbloom; 3rd August 2016 at 20:42.

  18. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    since my analysis shows that the fast LZ+EC engine will spend a lot of time waiting for loads/branches, it may be possible that huffman+combined dist/len slot is too fast for this pupose and FSE and/or separate dist/len tables are preferable. especially taking into account the slowness of huffman table rebuild

    i still insist that interleaving of match finder and EC is important since it makes EC almost free. for FSE this means that we should encode in the forward direction and decode in the opposite one. and of course employ the semi-adaptive approach
    Last edited by Bulat Ziganshin; 2nd June 2014 at 22:13.

  19. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    If Huffman table rebuild time is dominated by sorting time then you can try to speed it up by employing the AVX/ AVX2 heapsorts I've developed.

    Is there some fast standalone 64-bit Huffman codec in which I can easily swap the sort procedure? Something that compiles under Clang 3.3.

    Update:
    When going into AVX2 then in addition to SIMDified heapsort one can use vectorized Huffman encoding and decoding. AVX2 has gather instruction making that sensible.

    Also when it comes to sorting - probably it's not rare to have lots of symbols that occured very rarely and few symbols that occured frequently. Thus it could be beneficial to firstly split symbols into two sets - one set of frequently occuring symbols that would be sorted using heapsort and one set of symbols that occured rarely so they can be sorted cheaply using counting sort.
    Last edited by Piotr Tarsa; 3rd June 2014 at 01:01.

  20. #16
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by cbloom View Post
    There's no fast adaptive ANS that I know of at the moment. However, RANS has much faster table build times than Huffman. This means it could be easily used in a "semi-adaptive" way (aka "deferred summation"). That is, rebuild the table using the last N counts seen, and start with N=512 then 1024, etc..
    I wonder if the fundamental reason that ANS tables can be built faster is that we're tolerating them being done a little bit sloppy. The Huffman algorithm builds an optimal tree. For tANS, on the other hand, we still don't have an optimal table algorithm, and FSE's algorithm is very, very approximate. But ANS may have higher tolerances for error. Or, perhaps, we are building Huffman to unnecessary precision. (Obviously, if you build a perfect Huffman tree, and then substitute the canonical tree, some effort might have been wasted.)

  21. #17
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    nburns, Huffman algorithm is about finding the best p_s \approx 2^{-k_s} approximation of probability distribution.
    In contrast, tANS starts with much better: p_s \approx L_s/L approximation of probability distribution. The cost of sloppy symbol spread is that, for a single symbol, from some state (x) we use more bits to encode it, from some we use less (depending on local symbol fluctuations):
    DeltaH ~ sum_{s,x} (q_{sx} - p_s)^2 / p_s
    This small state dependent variation around p_s \approx L_s/L approx q_{sx} causes some subtle rate penalty, but it is usually much smaller than penalty of approximating probability distribution.

    However, RANS has much faster table build times than Huffman.
    Especially Fabian's alias variant - most updates of probability distribution can be made by very quick: shifting the divider.
    Last edited by Jarek; 3rd June 2014 at 10:20.

  22. #18
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Jarek View Post
    nburns, Huffman algorithm is about finding the best p_s \approx 2^{-k_s} approximation of probability distribution.
    In contrast, tANS starts with much better: p_s \approx L_s/L approximation of probability distribution. The cost of sloppy symbol spread is that, for a single symbol, from some state (x) we use more bits to encode it, from some we use less (depending on local symbol fluctuations):
    DeltaH ~ sum_{s,x} (q_{sx} - p_s)^2 / p_s
    This small state dependent variation around p_s \approx L_s/L approx q_{sx} causes some subtle rate penalty, but it is usually much smaller than penalty of approximating probability distribution.
    Hi Jarek.

    When I say that Huffman is optimal, I mean optimal in the sense that was proved in the original research: http://compression.ru/download/artic...ancy-codes.pdf

    What I'm getting at is that when you build a Huffman tree using the standard algorithm, there is exactly one unique tree/permutation of symbols that is the solution (assuming all the probabilities are unique). Typically, all that you really need are the code lengths, though, because that's all that affects the file size. So generating the exact unique Huffman tree is unnecessarily precise, and I'm speculating that the extra sorting effort and data movement might be what's dragging Huffman down. FSE doesn't attempt to find the unique optimal permutation of symbols when it builds the table -- even though it must exist. FSE scatters the symbols using a O(n) algorithm that was chosen mainly because it would run fast. If FSE generated the optimal tANS table, the algorithm would probably not be simple and linear anymore, and there's an excellent chance that it would no longer be faster than Huffman.

    There was a thread about Huffman recently, and it got me thinking about how fast the Huffman initialization could be done. Even though it's not a very complicated algorithm to begin with, I think there's a decent chance that you could make it even simpler and faster by, e.g. only partitioning symbols into length classes and not building the whole tree.

  23. #19
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    There are Polar codes: http://www.ezcodesample.com/prefixer...r_article.html

    What I'm getting at is that when you build a Huffman tree using the standard algorithm, there is exactly one unique tree/permutation of symbols that is the solution (assuming all the probabilities are unique).
    Hmm, is there only one? I had an impression that sometimes more than one optimal Huffman tree is possible - I'm talking about the shape so eg about different number of symbols with a code of particular length.

    I'm not sure how that would pan out, but assume you're building a Huffman tree and you're left with 3 subtrees to join and all of them have equal weight. Thus you can join the subtrees in any order. If they have different shape then you'll probably end with different shape of final tree, if my intuition is right.

  24. #20
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Hi nburns,
    Indeed Huffman is optimal - among quantizations with powers of 1/2 - what is the big restriction of prefix codes.
    There are some approximations, likePolar trees, however it also needs sorting.

    Regarding tANS, FSE spread just visits each position of decoding table twice - already using better quantization: p_s ~ L_s/L.
    However, in contrast to prefix codes, it doesn't exactly work using this q_s = L_s/L : depending on the current state, sometimes symbol s is stored using a bit more than lg(1/q_s) bits, sometimes a bit less.
    Fast but inaccurate FSE spread makes that these fluctuations around lg(1/q_s) are larger, but it is not much larger and can be compensated by using larger table.
    As for now, finding the optimal symbol spread is a tough problem - requires searching through exponential number of possibilities.

  25. #21
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    There are Polar codes: http://www.ezcodesample.com/prefixer...r_article.html


    Hmm, is there only one? I had an impression that sometimes more than one optimal Huffman tree is possible - I'm talking about the shape so eg about different number of symbols with a code of particular length.

    I'm not sure how that would pan out, but assume you're building a Huffman tree and you're left with 3 subtrees to join and all of them have equal weight. Thus you can join the subtrees in any order. If they have different shape then you'll probably end with different shape of final tree, if my intuition is right.
    I think I get what you're saying. If you try it, I think you'll find that if the subtrees have identical weights, it's implied that they will also have identical shapes.
    Last edited by nburns; 3rd June 2014 at 18:36.

  26. #22
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    I think I get what you're saying. If you try it, I think you'll find that if the subtrees have identical weights, it's implied that they will also have identical shapes.
    There can be several optimal trees for optimal static huffman file compression.
    Example

    a = 2 or 1/3 if symbols a rest make up 1/6 of symbols each
    b = 1
    c = 1
    d = 1
    e = 1

    table from one optimal tree

    a 1 ** note bits assignments that leave the tree same don't count **
    b 010
    c 001
    d 011
    e 000

    table from a second also optimal huffman tree

    a 11
    b 101
    c 100
    d 01
    e 00


    using first table
    a a b c d e
    1 1 010 001 011 000 14 bits

    using second table
    a a b c d e
    11 11 101 100 01 00 14 bits
    Last edited by biject.bwts; 4th June 2014 at 01:35. Reason: fixed errors Tarsa saw

  27. The Following 2 Users Say Thank You to biject.bwts For This Useful Post:

    nburns (4th June 2014),Piotr Tarsa (4th June 2014)

  28. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Good one, David. I didn't think it would be that simple. It's obvious that the last 3 subtrees to merge were of different shapes. Two of them had 3 nodes each and one had one node (counting leafs as nodes). Joining them in different order produced differently shaped final tree.

    Also you made slight mistake - a should have code 11 not 10 in second example.
    Last edited by Piotr Tarsa; 4th June 2014 at 00:17.

  29. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    biject.bwts (4th June 2014)

  30. #24
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    There can be several optimal trees for optimal static huffman file compression.
    Example

    a = 2 or 1/3 if symbols a rest make up 1/6 of symbols each
    b = 1
    c = 1
    d = 1
    e = 1

    table from one optimal tree

    a 1 ** note bits assignments that leave the tree same don't count **
    b 010
    c 001
    d 011
    e 000

    table from a second also optimal huffman tree

    a 10
    b 101
    c 100
    d 01
    e 00


    using first table
    a a b c d e
    1 1 010 001 011 000 14 bits

    using second table
    a a b c d e
    10 10 101 100 01 00 14 bits
    Good counter-example. I'm trying to think about ,whether that invalidates any of the other points I made.

    My larger point was that Huffman gives an unnecessarily specific result, which costs time. Clearly, it isn't specific about how to break ties. But I was pointing out the information that's thrown away when generating a Huffman tree and using only the lengths. I'm not sure whether you can eliminate calculating all the node weights, for instance, but you might be able to eliminate the tree link pointers, which saves complexity and time.
    Last edited by nburns; 4th June 2014 at 00:46.

  31. #25
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Huffman tree has linear size anyway, so sorting dominates total time, asymptotically.

    One could do approximate sorting by eg firstly invoking a selection algorithm to find k'th most frequent element, then sort the most frequent k elements and treat the rest as having equal weights.

    I'm still pretty sure that proper variant of heapsort can be faster than quicksort on tables that fit in L1 cache. Heapsort can greatly reduce the number of unpredicted jumps. But as heapsort means lots of cache misses on large tables, it's not used much generally.
    Last edited by Piotr Tarsa; 4th June 2014 at 00:55.

  32. #26
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I think the simplest example of different Huffman tree shapes is with 4 symbols when two have probability 1/3 and two have probability 1/6. For example, if a and b are 1/3, you could have

    a:0
    b:10
    c:110
    d:111

    or

    a:00
    b:01
    c:10
    d:11

    My personal experience is that missing optimum code lengths for a few long codes by a bit has small impact. Missing by a lot is much more significant and code lengths for very recently used symbols should be a lot shorter than their Huffman code.
    Last edited by Kennon Conrad; 4th June 2014 at 03:16.

  33. #27
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Huffman tree has linear size anyway, so sorting dominates total time, asymptotically.

    One could do approximate sorting by eg firstly invoking a selection algorithm to find k'th most frequent element, then sort the most frequent k elements and treat the rest as having equal weights.

    I'm still pretty sure that proper variant of heapsort can be faster than quicksort on tables that fit in L1 cache. Heapsort can greatly reduce the number of unpredicted jumps. But as heapsort means lots of cache misses on large tables, it's not used much generally.
    You could quantise frequencies and do a bucket sort.


    BTW, static, predefined code tables are out of fashion recently, but they could be OK for a super-fast codec.

  34. #28
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Huffman tree has linear size anyway, so sorting dominates total time, asymptotically.
    True, but there's asymptotically-fast, and there's just plain fast. There's no simple relationship between the two.

    One could do approximate sorting by eg firstly invoking a selection algorithm to find k'th most frequent element, then sort the most frequent k elements and treat the rest as having equal weights.
    I don't like approximate algorithms much. You'd need to put bounds on all the good and bad cases in order to be thorough in your analysis and be confident it's a good algorithm, which is a lot of thinking; but, at the end of the day, it's still only an approximate algorithm. It would have to just crush Huffman in speed, or else who would use it? I wouldn't.

  35. #29
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by m^2 View Post
    You could quantise frequencies and do a bucket sort.
    It doesn't seem like an obvious opportunity for bucket sort, because the frequencies tend to span several orders of magnitude.

    I think improving the worst-case asymptotic bound is impossible, because you can construct a situation where a full sort is needed: every symbol has a different tree depth except for one pair. That's at one extreme, though. At the other extreme, all symbols are at the same depth, and no sorting is needed. Catching the simple cases and stopping early is one possible approach.

  36. #30
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by m^2 View Post
    BTW, static, predefined code tables are out of fashion recently, but they could be OK for a super-fast codec.
    i've tested various coders with 16 kb hash table. on the binary file 1g:
    Code:
    lz4:         54.73%,  time  2.560 secs,  speed 372.582 mb/sec
    tor -1 -c1:  58.94%,  time  4.105 secs,  speed 232.348 mb/sec
    tor -1 -c2:  52.69%,  time  4.566 secs,  speed 208.872 mb/sec
    tor -1 -c3:  45.31%,  time  6.279 secs,  speed 151.887 mb/sec
    tor -1 -c4:  45.01%,  time 11.913 secs,  speed  80.054 mb/sec
    and on the enwik9:
    Code:
    lz4:         50.81%,  time  2.879 secs,  speed 331.213 mb/sec
    tor -1 -c1:  53.13%,  time  5.003 secs,  speed 190.620 mb/sec
    tor -1 -c2:  47.69%,  time  5.501 secs,  speed 173.356 mb/sec
    tor -1 -c3:  39.73%,  time  7.138 secs,  speed 133.608 mb/sec
    tor -1 -c4:  39.68%,  time 11.728 secs,  speed  81.314 mb/sec
    -c1..4 are "bytecoder", "bitcoder", "hufcoder" and "aricoder"
    Last edited by Bulat Ziganshin; 4th June 2014 at 22:33.

  37. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Cyan (5th June 2014)

Page 1 of 8 123 ... LastLast

Similar Threads

  1. another (too) fast compressor
    By Cyan in forum Data Compression
    Replies: 139
    Last Post: 6th February 2016, 21:41
  2. LZSR fast compressor !
    By Nania Francesco in forum Data Compression
    Replies: 8
    Last Post: 4th October 2011, 05:46
  3. PACKET v.0.01 new fast compressor !
    By Nania Francesco in forum Data Compression
    Replies: 45
    Last Post: 19th June 2008, 01:44
  4. RINGS Fast Bit Compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 115
    Last Post: 26th April 2008, 21:58
  5. Fast PPMII+VC Compressor
    By in forum Forum Archive
    Replies: 4
    Last Post: 2nd August 2006, 19:17

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •