Results 1 to 15 of 15

Thread: [LZ] Optimal parsing

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    [LZ] Optimal parsing

    Recently I've added optimal parser to the Tornado. It took just 3 days for basic algorithm so i want to share the scenario with you.

    So, i had an LZ compressor with greedy/lazy parsing and multiple entropy coders - byte/bit/huf/ari. In order to add optimal parsing i made the following:

    1. I had the find_best_match() function that checked multiple matches and selected the best one for greedy/lazy parser based on several heuristics. It was replaced with find_all_matches() that just saves all found matches with increasing lengths. Actually, it was the last edit i done, and it increased OP compression ratio only by 1-2% compared to using just the "best match"

    2. The core of optimal parsing is match/literal pricing. It was quite easy doable for Tornado - essentially i just copied match/literal encode functions and replaced encoding actions with calculation of bit lengths; you may find this code in lit_price() and match_price() functions, specific for every entropy coder. For arithmetic encoding, though, prices are precalculated once the encoding tables are changed in order to make OP a bit faster.

    3. Now we have all ingredients for optimal parser. I perform OP with forward evaluation of matches and backward best-path search. Input data are split into blocks of OPTIMAL_WINDOW (==3276 bytes, and every block processed with code:
    Code:
    start_block() 
    
    for (pos=0; pos<OPTIMAL_WINDOW; pos++)
      match_list = find_all_matches(pos)
      evaluate_literal_and_matches(match_list) 
    
    encode_block()
    Here, start_block() prepares the position price table:
    Code:
    price[0] = 0
    price[1..OPTIMAL_WINDOW] = UINT_MAX
    evaluate_literal_and_matches() checks whether literal/matches starting at the current position can reduce price of positions they are reaching:
    Code:
    price[pos+1] ?= price[pos] + literal_price(data[pos])  
    // at success: len[pos+1]=1, dist[pos+1]=0
    
    for every match in match_list and every intermediate match_len
      price[pos+match_len] ?= price[pos] + match_price(match_len, match_dist)
      // at success: len[pos+match_len]=match_len, dist[pos+match_len]=dist
    where a?=b means a=min(a,b) and "at success" code executed only when new price is assigned. This code assigns the minimum possible encoding price to every position in the optimum-search window, while saving the (len,dist) of corresponding match/literal.

    encode_block()
    finds the best path to pos==OPTIMAL_WINDOW by following trace of lengths stored:
    Code:
    for (pos = OPTIMAL_WINDOW; pos>0; )
      encode_match_or_literal (len[pos], dist[pos])
      pos -= len[pos]


    That's all that you need for implementation of correct optimal parser. Now a few optimizations:

    1. The first version runs ~10x slower on binary data than on the text file. The reason is simple - on highly repetitive data such as long runs of zeroes, it observes O(n^2) time behaviour by checking in every position price of every possible match up to maximum length found. For example, processing 45 kb run of zeroes performs ~10^9 price checks that requires an order of 1 second of cpu time.

    The solution is the so-called FAST BYTES parameter. Once the match longer than the FAST_BYTES is found, it is accepted unconditionally (essentially implementing the greedy parsing for all those bytes). This technique works extremely well, making binary files processing even faster than processing of the text files, while losing only 0.1% or so of compression ratio (with FAST_BYTES=512). Moreover, reduction of the FAST_BYTES parameter further improves the compression speed with only 1-2% loss of the compression ratio, allowing us to build broad range of compression profiles. Tornado uses FAST_BYTES=16/32/64/64/128/512 for the -11..-16 compression profiles, and FreeArc employs lzma with FAST_BYTES=16/32/128/273 for various compression profiles.

    2. The second most important improvement is the REPDIST codes support. Tornado automatically supports REPDIOST codes by checking for repeated distances in the LZ coder, but that's not enough. Addition of REPDIST checks in the optimal parser itself improved binary files compression by as much as 3%! Tornado 0.6 implements this in the evaluate_literal_and_repdist() function. So, prior to checking (and even searching for) any matches, it retrieves (the already found) optimal path to the current position and checks matches at the last 4 distances involved.

    Here, i implemented one more optimization - every next repeated match is checked starting from the length+1 of the previous successful repeated match. Once the repeated match longer than FAST_BYTES is found, it is accepted unconditionally. Finally, the regular matches are checked starting with length+1 of longest repeated match. This improves speed without much loss of compression. lzma implements more careful strategy - it doesn't truncate repeated matches but checks all regular matches starting with len+1 of the first repeated match.

    3. The remaining is a collection of simpler tricks. 5-20% speed imrovement was gained by prefetching hash rows for the next file position prior to handling the current one - look for prefetch_hash(p+1). In lzma, i prefetch hash for a few positions ahead, and even prefetch the binary tree cells.

    OptimalParser class stores most of its data in the heap, but the prices[] table is allocated in the stack, improving speed by ~10%




    To be continued...
    Last edited by Bulat Ziganshin; 10th March 2014 at 19:19.

  2. The Following 6 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    clns (2nd March 2019),Cyan (10th March 2014),encode (6th February 2016),Matt Mahoney (10th March 2014),nburns (10th March 2014),Turtle (6th October 2015)

  3. #2
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    That's all that you need for implementation of correct optimal parser.
    Is this a single pass process? Isn't match_price going to evolve based on match_len and match_dist frequencies? How can you know them beforehand?

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    1. it may be implemented in multiple passes for improving compression, but lzma and tornado does it in the single pass
    2. lzma/tornado use prices established prior to block processing. lzma has the maximum optimal block length kNumOpts==4kb, but starts new block every time the match with len>=fast_bytes was found
    3. overall it seems not very important although of course the resulting compression isn't optimal in the math sense

  5. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I don't like the name. If you can improve it, it's not optimal, period.

  6. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by m^2 View Post
    I don't like the name. If you can improve it, it's not optimal, period.
    As I've pointed out before, optimal depends on assumptions. So different assumptions lead to different targets for optimality. What's commonly called optimal parsing is not optimal in the global sense, i.e. that it can't be improved upon.

    It's true, though, that it seems like a contradiction to claim that a parsing is optimal when it's based on heuristics.

  7. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    How about calling it optimalish parsing? Anybody who's worried about whether or in what technical sense it's "optimal" will then know to look into it.

  8. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    How about calling it optimalish parsing? Anybody who's worried about whether or in what technical sense it's "optimal" will then know to look into it.
    Well, nobody knows how to do it optimally in the can't-be-improved-upon sense. LZ has a very open-ended definition, such that that problem may never be solved. So you can substitute optimalish for all optimals in relation to LZ77 compressors.

  9. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    You can do optimal LZSS parsing and it is quite simple: http://cbloom.com/algs/dictionary.html#optimal
    With high probability making optimal parser for other LZ variations should be perfectly feasible as long as we constrain ourselves to fixed encodings. I think that we can remove even that constraint provided that we do entropy coding in fixed size blocks and encoding of a particular block is not dependent on the contents of that block - only on contents of previous ones.

    So eg the total solution for optimal parsing and semi-adaptive entropy coding would be to:
    - split code stream to blocks and encode them separately using block based entropy coder (ie same code lengths throughout the whole block),
    - make entropy coding decoupled from histogram of currently coded block,
    - use some creative methods to generate probability distributions based on contents (not the generated code streams) of limited number of previous blocks and predefined probability distributions,
    - when encoding, try every (or only some when being in fast compression mode) generated probability distributions for current block in some order and then use the best one to code symbol stream,
    - before outputting code stream for current block output the serial number of the chosen probability distribution,

    Of course that's solving a problem by simplifying the problem, but nonetheless it could work and it's compatible with block based entropy coders.

    I think that making an optimal parser in "the can't-be-improved-upon sense" can be perfectly feasible for above compression scheme.
    Last edited by Piotr Tarsa; 17th March 2014 at 01:46.

  10. #9
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    However, is it better to be optimal in a less efficient model or suboptimal, but still good in a more efficient one? I suppose the latter wins.

  11. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Of course that's solving a problem by simplifying the problem, but nonetheless it could work and it's compatible with block based entropy coders.
    Or substituting a tractable problem for an intractable problem and solving the former? That's perfectly acceptable, of course.

  12. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Maybe "optimised" describes it as it demonstrates that the parsing has gone through rounds of optimisation, but much like the -O flag in a compiler it may not be the most "optimal" version, just better than not optimising at all.

  13. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by JamesB View Post
    Maybe "optimised" describes it as it demonstrates that the parsing has gone through rounds of optimisation, but much like the -O flag in a compiler it may not be the most "optimal" version, just better than not optimising at all.
    The compiler analogy isn't a good one, because there's a such thing as unoptimized code, but there's no such thing as unoptimized LZ. To use LZ, you need to pick some matching strategy, whether it be heuristics-based, greedy, lazy, or some flavor of optimal. Optimal here is like optimal anywhere -- it optimizes some function that you get to choose, not necessarily giving the ultimate optimal that can never be outdone to the most general form of the problem. Nonetheless, it must come with a mathematical proof that it optimizes something.

  14. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Ferragima, et al describe a different kind of optimal LZ here (http://arxiv.org/abs/0802.0835) that might be worth reimplementing, especially since they didn't seem to release source code. Their encoding uses the optimal number of bits under the assumption that the encoding of lengths and distances must come from a class of codes that include Elias gamma and delta.

  15. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Property 1 (Increasing Cost Property). For any x, y ? [n] it is x ? y i? |f(x)| ? |f(y)|.

    The main open question is to extend our results to statistical encoding functions like
    Huffman or Arithmetic coders applied on the integral range [n] [22]. They do not necessarily
    satisfy Property 1 because it might be the case that |f(x)| > |f(y)|, whenever the integer y
    occurs more frequently than the integer x in the parsing of S. We argue that it is not trivial to
    design a bit-optimal compressor for these encoding-functions because their codeword lengths
    change as it changes the set of distances and lengths used in the parsing process.
    Last edited by Bulat Ziganshin; 19th March 2014 at 10:40.

  16. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Bulat,

    Basically, they're saying that optimizing the set of (distance, length) pairs is an unsolved problem in cases where the cost function for a given pair depends on the entire set, e.g. when the cost of encoding an integer depends on the frequency of occurrences of that integer. Those kinds of circular dependencies prevent you from using dynamic programming to get a polynomial-time solution, and there's a good chance that some or all of those problems will be NP-hard.

Similar Threads

  1. Replies: 75
    Last Post: 24th March 2014, 19:34
  2. Optimal ordered bitcodes (or smth like this)
    By Piotr Tarsa in forum Data Compression
    Replies: 29
    Last Post: 26th July 2011, 13:18
  3. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 13:07
  4. parsing methods
    By Piotr Tarsa in forum Forum Archive
    Replies: 18
    Last Post: 9th August 2007, 06:45

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •