Recently I've added optimal parser to the Tornado. It took just 3 days for basic algorithm so i want to share the scenario with you.
So, i had an LZ compressor with greedy/lazy parsing and multiple entropy coders - byte/bit/huf/ari. In order to add optimal parsing i made the following:
1. I had the find_best_match() function that checked multiple matches and selected the best one for greedy/lazy parser based on several heuristics. It was replaced with find_all_matches() that just saves all found matches with increasing lengths. Actually, it was the last edit i done, and it increased OP compression ratio only by 1-2% compared to using just the "best match"
2. The core of optimal parsing is match/literal pricing. It was quite easy doable for Tornado - essentially i just copied match/literal encode functions and replaced encoding actions with calculation of bit lengths; you may find this code in lit_price() and match_price() functions, specific for every entropy coder. For arithmetic encoding, though, prices are precalculated once the encoding tables are changed in order to make OP a bit faster.
3. Now we have all ingredients for optimal parser. I perform OP with forward evaluation of matches and backward best-path search. Input data are split into blocks of OPTIMAL_WINDOW (==3276 bytes, and every block processed with code:
Here, start_block() prepares the position price table:
for (pos=0; pos<OPTIMAL_WINDOW; pos++)
match_list = find_all_matches(pos)
evaluate_literal_and_matches() checks whether literal/matches starting at the current position can reduce price of positions they are reaching:
price = 0
price[1..OPTIMAL_WINDOW] = UINT_MAX
where a?=b means a=min(a,b) and "at success" code executed only when new price is assigned. This code assigns the minimum possible encoding price to every position in the optimum-search window, while saving the (len,dist) of corresponding match/literal.
price[pos+1] ?= price[pos] + literal_price(data[pos])
// at success: len[pos+1]=1, dist[pos+1]=0
for every match in match_list and every intermediate match_len
price[pos+match_len] ?= price[pos] + match_price(match_len, match_dist)
// at success: len[pos+match_len]=match_len, dist[pos+match_len]=dist
encode_block() finds the best path to pos==OPTIMAL_WINDOW by following trace of lengths stored:
for (pos = OPTIMAL_WINDOW; pos>0; )
encode_match_or_literal (len[pos], dist[pos])
pos -= len[pos]
That's all that you need for implementation of correct optimal parser. Now a few optimizations:
1. The first version runs ~10x slower on binary data than on the text file. The reason is simple - on highly repetitive data such as long runs of zeroes, it observes O(n^2) time behaviour by checking in every position price of every possible match up to maximum length found. For example, processing 45 kb run of zeroes performs ~10^9 price checks that requires an order of 1 second of cpu time.
The solution is the so-called FAST BYTES parameter. Once the match longer than the FAST_BYTES is found, it is accepted unconditionally (essentially implementing the greedy parsing for all those bytes). This technique works extremely well, making binary files processing even faster than processing of the text files, while losing only 0.1% or so of compression ratio (with FAST_BYTES=512). Moreover, reduction of the FAST_BYTES parameter further improves the compression speed with only 1-2% loss of the compression ratio, allowing us to build broad range of compression profiles. Tornado uses FAST_BYTES=16/32/64/64/128/512 for the -11..-16 compression profiles, and FreeArc employs lzma with FAST_BYTES=16/32/128/273 for various compression profiles.
2. The second most important improvement is the REPDIST codes support. Tornado automatically supports REPDIOST codes by checking for repeated distances in the LZ coder, but that's not enough. Addition of REPDIST checks in the optimal parser itself improved binary files compression by as much as 3%! Tornado 0.6 implements this in the evaluate_literal_and_repdist() function. So, prior to checking (and even searching for) any matches, it retrieves (the already found) optimal path to the current position and checks matches at the last 4 distances involved.
Here, i implemented one more optimization - every next repeated match is checked starting from the length+1 of the previous successful repeated match. Once the repeated match longer than FAST_BYTES is found, it is accepted unconditionally. Finally, the regular matches are checked starting with length+1 of longest repeated match. This improves speed without much loss of compression. lzma implements more careful strategy - it doesn't truncate repeated matches but checks all regular matches starting with len+1 of the first repeated match.
3. The remaining is a collection of simpler tricks. 5-20% speed imrovement was gained by prefetching hash rows for the next file position prior to handling the current one - look for prefetch_hash(p+1). In lzma, i prefetch hash for a few positions ahead, and even prefetch the binary tree cells.
OptimalParser class stores most of its data in the heap, but the prices table is allocated in the stack, improving speed by ~10%
To be continued...