Results 1 to 23 of 23

Thread: Improved MTF, WFC or qlfc post-processing

  1. #1
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts

    Improved MTF, WFC or qlfc post-processing

    Hello, I done several tests with MTF and qlfc post-processing. It seems when data are processed in a better way (before final entropy coding stage), MTF / qlfc transform can actually improve compression overall (even instead of deterioration by these transforms). And I think this could be used with DC or IF algorithms as well.

    I have used a hybrid of Elias Gama code with following modifications below as the intermediate step, but I think it could be the last step as well., or part of it, which i think could be computed relatively fast.

    So the post-processing looks as follows( stream1, stream2 ):

    1
    010
    011
    00100
    00101
    ..
    Well, so far this looks like a common Elias Gama Code : ) but notice all prefixes, which contain N zeros and foremost '1' character, will be encoded separately. The rest is a RLE-Bit encoding, with O(1) transformation, or modification I presented here http://encode.ru/threads/546-Move-to...ll=1#post22183

    Code:
    int rle0Trans[] = new int[] {1,2,3,4,5,6,7,8,9,11,10,13,12,14,15,16,17,19,23,18,21,27,20,25,24,28,26,29,22,30,
                    31,32,33,35,39,47,34,37,43,55,36,41,51,40,49,48,38,45,59,42,53,57,44,50,56,52,54,60,58,46,61,62,
                    63,64,65,67,71,79,95,66,69,75,87,111,68,73,83,103,72,81,99,80,97,96,70,77,91,119,74,85,107,76,89,115,82,101,88,
                    113,104,100,98,112,84,105,123,120,114,102,78,121,116,106,86,117,90,93,108,109,92,125,124,122,118,110,94,126}; 
    
    if(rleBitChar <= 126) rleBitChar = rle0Trans[(int)rleBitChar - 1];
    .. now do RLE-Bit encode
    This transform preceding Rle-Bit implies that RLE-Bit stream should contain less '1' symbols for more common input characters.

    Some information on RLE-Bit:
    http://www.juergen-abel.info/Preprin...BWT_Stages.pdf

    Now compression workflow could look following:

    1. At first encode RLE on plain BWT output, with runs of identical characters, but basically the same way as all remaining characters in step 3. Anyway notice we need to encode both runs of identical characters AND their positions as distances against remaining characters (only in this step), which result in four separate streams (RLE, their distances and than both code prefixes & transformed RLE-Bit codes above, which I encoded by 4 separate EC models). I got some 6,358,486 from this RLE on enwik8bwt (RLE including distances).
    2. Second step is MTF, WFC or qlfc transform, on all remaining characters, but these now do not contain runs of identical adjacent characters, which means resulting alphabet will be one character smaller than input one.
    3. Proceed all MTF / qlfc output characters with post-processing code above, but notice for first character(most common one), we can code only prefix, no RLE-Bit stage needed. So now we create two streams for all characters, prefixes and transformed RLE-Bit, resulting in fully binary alphabet.
    4. Last step is binary entropy coding (= binary RC, Arithmetic coding).

    I think this workflow could be bit different e.g. for DC or IF. but I haven't tested for these so far.
    Also many of these steps can be made in parallel.

    Results: I got some 20056570 on enwik8bwt with qlfc and binary rc, while following post-processing got me some 1MB improvement.
    Anyway I don't have a working encoder or demo yet, I used a bunch of independent scripts and in-place code modifications. So the question is how this will behave with different data formats than text, and some changes might be needed.
    Last edited by quadro; 3rd June 2013 at 19:55.

  2. #2
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    It seems stream1 (code prefixes) is far better compressed as whole number, e.g. for o1 MTF or qlfc this is typically 0 .. 7,

    My rc is not
    optimized at this stage, but other coders I tried shows this weird property too.

    Compression got
    20,056,570 -> 19,791,620, without rc optimizations, (stream1 uses byte model, stream2 have bit model).

    I will try to release some demo, when I got it to work.
    Last edited by quadro; 4th June 2013 at 02:19.

  3. #3
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Ok, It seems following workflow might not be ideal, so I changed it a bit. Now I am saving stream1 in a byte form, altogether with RLE-bit stream (runs of adjacent characters) to preserve original contexts. In form that 0 & 1 represents runs of adjacent characters(in Rle-Bit form), 2 and 3 are MTF/qlfc output chars which represent only itself and other characters(4+) is stream1 but in a byte form, which now represents number of bits -2, which are then coded according to this in separate contexts .

    I tried few different transformations, Because it seems the one above, may help with RLE, but is not ideal for coding mtf/qlfc output, which has some additional context and is much less random . First one (trans1) seems to fit MTF/qlfc much more (characters which are coded according number of bits in separate contexts e.g. 2..5(2 bits) 6..13(3 bits) 14..29(4 bits) etc)

    int trans1[] = {0,1,2,3,4,6,5,7,8,10,12,14,9,11,13,15}
    int trans2[] = {0,1,2,3,4,6,5,7,8,12,10,14,9,13,11,15}

    Now I am working on improving my rc. Some results for quasi static range-coder which is now coding whole bytes but does nothing but halves probabilities once and than: enwik8bwt: 23,827,091 +qlfc 22,913,462 +qlfc +changes above 21,316,056 (with no pre-processing, alphabet reordering or bwt reflections yet).


    update: trans3 seems to give 21,168,032 onsame setup & quasi static rc above

    int trans3[] = {0,1,2,3,6,7,4,5,8,10,12,14,15,9,11,13}
    lets see how those intervals 2..5(.), 6..13(x), 14..29(i) etc are in these transforms selected. I think improvement coms from fact that rc gives better results on probabilities close to 0 or 1 than 0.5 Since in MTF/qlfc smaller ranks are usually more probable, anyway I still would like to know if there is general way how to compute such transform in place.
    Code:
    //                  . . x x . . x x  x  i  i  x x  x
    int trans3[] = {0,1,2,3,6,7,4,5,8,10,12,14,15,9,11,13}
    Last edited by quadro; 11th June 2013 at 13:53.

  4. #4
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    It seeems probability asumption above might be right,

    So I created another transform also for interval 0..1 (still only for MTF/qlfc characters 0..15), anyway now also 2 & 3 chars are coded as rest MTF/qlfc characters (2 -> 0 (first MTF/qlfc character aside of RLE (which coded 0 & 1)) -> trans4 -> 0 (coded as 0) in first bit interval(coded as 2); 3-> 1 -> trans4 -> 2 (coded as 0), in second bit interval (coded as 3)).

    int trans4[] = {0,2,1,3,6,7,4,5,8,10,12,14,15,9,11,13}
    Code:
    // interval     0 . 0 . x x . . x x  x  i  i  x x  x ...
    int trans4[] = {0,2,1,3,6,7,4,5,8,10,12,14,15,9,11,13}
    This results in 20,810,330 with qlfc and trans4 followed by quasi static unoptimized byte rc. I think optimized adaptive and bit oriented rc (8 call / char) can give better results
    Last edited by quadro; 11th June 2013 at 16:38.

  5. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    This isn't related to QLFC, but since we're talking about the BWT, here's an idea I had. I don't know if it's been tried before.

    On text, at least, the BWT alternates between smooth areas with little or no entropy and choppy areas with high entropy. The smooth areas tend to be the insides of words, and the choppy areas are around the edges of words, or the places inside words where word roots intersect prefixes and suffixes, etc. If you chose a run in the BWT and followed it forward or backward to see what came before and after it in the text, there's a good chance that you'll stay inside runs for a few letters, until you hit the edge of the word. So seeing a run of letter E somewhere in the BWT predicts that there is likely to be two additional runs somewhere else in the BWT, one that corresponds to what's directly before that run (in the order of the text), and one that corresponds to what's directly after that run.

    When reading back the BWT, assuming you know the letter counts ahead of time, you can fully predict the first (sorted) column. As you go through the BWT column, then, you should have enough information to be able to make some predictions about the parts you haven't seen yet. For instance, if you see a run of Es between the 10th and 35th E in the BWT column, coinciding with the 50th-75th letter A in the first column, you can make two predictions: there's likely to be a run (of something) near the 10th-35th E in the first column, and there's likely to be a run (of As) near the 50th-75th A in the BWT column. All you have to do to know when you get there is to count letters.

  6. #6
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Quote Originally Posted by nburns View Post
    So seeing a run of letter E somewhere in the BWT predicts that there is likely to be two additional runs somewhere else in the BWT, one that corresponds to what's directly before that run (in the order of the text), and one that corresponds to what's directly after that run.
    Interesting, isn't this related somehow how BWT sorting algorithm works? I think e.g. BWT sorting reflections are designed to move similar parts together, but maybe it could be used somehow to improve RLE. Also I think M03 algorithm is doing some assumptions based on character counts from first columns and use such statistics to improve compression.


    Meanwhile I've done few tests on automatically generating transformations I used previously for better alignment.

    I made a few theoretical calculations and compared these with real data and compression results. For test I used qlfc ranks 1..6 from enwik8bwt (ignoring zero ranks) with following probabilities (1 to 6) 11079447, 5430027, 3606683, 2681309, 2094126, 1700935, to calculate intervals of variable lengths 0; 1 & 5; 2 & 4; 3 & 3 or 1 & 3 & 2 (huffman tree).

     IntervalTheoret. best(log2P) division costTransform generatedReal compression
    Direct EC0 no6,851,373
    All Comb.1 & 57.57855e+071 / 2 3 4 5 66,823,816
    All Comb.2 & 41.10952e+081 3 / 2 4 5 66,829,766
    All Comb.3 & 31.10241e+081 5 6 / 2 3 46,831,409
    Huffman tree1 & 3 & 2 1 / 2 3 4 / 5 66,831,262

    I think based on results Huffman does pretty good, and theoretical best transformation is somehow matched with real compression results (except for case when symbols are divided into half (3 & 3 interval) anyway it's only a small gap maybe caused by sharper difference in probabilities from previous case 2 & 4?), using all combinations these transforms could be generated within few seconds up to 14 symbols or so, but later it would take too long. Compression gains are not so big, but baybe It's so interval was quite small. Actual gain i think comes from difference between frequent characters and less frequent one, which were missing here.

    For larger alphabet I think huffman can be used, if we smooth down number of bits needed for all symbols, like here 3 intervals we shrink into 2, huffman does exactly as theoretical best (in this test). So I'd like to test it for larger alphabet.
    Last edited by quadro; 14th June 2013 at 16:14.

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by quadro View Post
    Interesting, isn't this related somehow how BWT sorting algorithm works? I think e.g. BWT sorting reflections are designed to move similar parts together, but maybe it could be used somehow to improve RLE. Also I think M03 algorithm is doing some assumptions based on character counts from first columns and use such statistics to improve compression.
    By reflections, do you mean like Gray codes?

    This is different. The lexicographic order groups points in the data by similar trailing context, but it does nothing to group points with similar positions in the original sequence. Changing the symbol order is still dealing only with context.

    Imagine a text that has the word "context" repeated 4 times. At the bottom of this post I filled out parts of what the BWT might look like (with bits of the 1st column filled in). I assumed that some of the symbols near the end of the word got broken up, and I didn't even bother filling in the final letter ts, because their positions would depend entirely on what came after, which could be anything. But let's assume the "cont" symbols formed solid runs.

    It's plain to see that the runs are correlated, even though the BWT spreads them far apart -- they came from the same word. It would be nice to take advantage somehow, but BWT algorithms typically don't. They try to guess things about the current local context, but different contexts are treated as if they have nothing to do with each other.

    But finding one from the other is really easy. Going from top to bottom, let's say you hit the run of 4 Os, which come behind Ns. You can make a guess that there's another run behind the same Os, and you know exactly when you'll hit those Os in column 1. All you have to do is count Os. So when you get to those Os, the model increases the probability of the next four symbols being the same.

    Symmetrically, based on the same run of Os, you can make a guess that the four Ns that are in front of them will be part of a run. From counting Ns, you know that you haven't hit those Ns yet in the BWT. Lower down, at some point you'll hit the first of those Ns, and the model will then increase the probability that the next three symbols will also be Ns. In fact, there's even more useful information available at that point. You know that the Ts in the first column were almost all contiguous when you saw them higher up in the BWT column. So that makes a run even more likely.

    ?c
    ?c
    ?c
    ?c
    .
    .
    te
    .
    te
    te
    te
    .
    .
    on
    on
    on
    on
    .
    .
    .
    co
    co
    co
    co
    .
    .
    nt
    nt
    nt
    nt
    .
    .
    .
    ex
    .
    ex
    ex
    .
    ex
    Last edited by nburns; 15th June 2013 at 01:38.

  8. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Here's a little utility that might be interesting. I wrote it to play with huffman, and I just added some more statistics. The last two columns in the output show, respectively, the number of wasted bits for each symbol, and a running total of wasted bits for all symbols. The waste is relative to the theoretical optimum. It's interesting to notice how on some symbols you win big, but then on other symbols you lose even bigger. Of course, when you've added up all the symbols, the best you can do is break even.

    Edit: Oops. The utility wouldn't compile. There's a new version, and it includes a Windows binary compiled with Cygwin (DLL included).
    Attached Files Attached Files
    Last edited by nburns; 17th June 2013 at 19:44.

  9. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I simplified and reworked the huffman utility. The new version is attached. There's no Windows binary for now.

    Click image for larger version. 

Name:	huffman_stats.png 
Views:	305 
Size:	61.7 KB 
ID:	2360

    If you can see this image, it's the output for the top five symbols for enwik8 and the bible after BWT, MTF, and Huffman. You can see that when only using MTF and Huffman, by far the majority of the waste comes from zero.
    Attached Files Attached Files

  10. The Following User Says Thank You to nburns For This Useful Post:

    quadro (24th June 2013)

  11. #10
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Thanks for the utility, I might try some of these things when I get on to some more dynamic modeling,


    I think waste made by Huffman might decrease, if we compress later actual Huffman code node states, not characters itself.

  12. #11
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I briefly tried method above, got quite impressive results. I compared Huffman to cyclic trees, I was using to divide symbols on sub-intervals and It seems Huffman actually wins on high frequent symbols(to cyclic tree), but overall on whole freq. distribution cyclic tree may beat Huffman, which I think implies need to divide freq. distribution on several sub-intervals, in case freq. distribution is unsettled (like of MTF) or for each freq. interval compute various trees and pick best, low cost one.

    Tree typeFrequent symbols (1..8+)Whole freq. distribution Tree structure used
    Huffman95,062,279137,509,894Huffman on high freq. symb. or all
    Cyclic tree107,337,291162,367,167(1/2) / ( (4 / (7/8+) ) / ( 3 / (5/6) ) )
    Bin. cyclic tree109,983,187158,343,796( (1/2) / (3/4) ) / ( (6/7) / (5/8+) )
    (Table shows number of bits produced by various tree on enwik8 qlfc, ignoring zero counts, final EC stage missing, 8+ is escape symbol)


    So I think downside can be final EC stage will probably have to reset model when it switches on intervals of different tree structure and statistics, but it should not be so costly when not done frequently.
    Last edited by quadro; 26th June 2013 at 08:53. Reason: Corrected Huffman result see below

  13. The Following User Says Thank You to quadro For This Useful Post:

    Stephan Busch (24th June 2013)

  14. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    What's the size of fully decodable self-contained compressed file?

  15. #13
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    It Seems I skewed data for Huffman, just corrected the table, anyway it seems overal number of bits produced in this step is not so much important, because Hufmann node states are overally pretty compressed, prob. are close to 0,5. which may not be true for other trees.
    But I think it still it may lead to compression following state changes from root, separating them. Anyway even better I think may be deriving a fitting tree from descending Huffman code lengths, ignoring escape symbols etc.

    Will try to give some more data, but have to do more tests.
    Last edited by quadro; 26th June 2013 at 09:11.

  16. #14
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    update: So I think i finally got Huffman working as should be, sorry for a bit messed up stats.

    I tried now to encode characters 1 - 207 from enwik8qlfc and got 15,507,349 ( no RLE, and coded from root to the leaves).

    Then I tried reverse all codes and code upwards i.e. bottom to the root and got 14,645,465,

    Then I tried to switch characters, 0x0C, 0x0D and got 14,604,287, and 14,592,930 from two character swap 0x0C, 0x0D & 0x11, 0x12. So I think it seems that for better results Huffmann will have to be sorted somehow...

    Code:
     
     0:   0 0
     1:   0 1 0
     2:   0 1 1
     3:   0 1 1 0
     4:   1 0 0 1
     5:   1 1 0 1
     6:   0 1 1 1 0
     7:   0 0 0 0 1
     8:   0 0 1 0 1
     9:   0 0 1 1 1
     10:  0 1 1 1 1
     11:  1 1 1 1 1 0
     12:  1 1 0 0 0 1
     13:  0 1 0 1 1 1
     14:  0 1 1 1 1 1
     15:  0 0 1 1 1 1 0
     16:  1 0 1 0 0 0 1
     17:  1 0 1 0 1 0 1
     18:  0 1 1 0 1 1 1
     19:  1 1 1 1 1 1 1

    But for now it still loses to cyclic, escaped tree like ( (10+ ) / ( 1 / (2/3) ) ) / ( (4/5) / ( (6/7) / (8/9) ) ), where I put escape 10+ on the top, because it seems preserve context better and it can be used as RLE only as well, because Zeros are contextual dependent.

    Lets see how sorting or some other transform help. But I think'll finally release some test app soon. with both Huffman and escaped tree for a comparison, but will have to tune it up first.
    Last edited by quadro; 26th June 2013 at 13:32.

  17. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by quadro View Post
    Then I tried to switch characters, 0x0C, 0x0D and got 14,604,287, and 14,592,930 from two character swap 0x0C, 0x0D & 0x11, 0x12. So I think it seems that for better results Huffmann will have to be sorted somehow...
    Sorted by what?

    I'd try to chip in with more feedback if I had a better understanding of what you're talking about. I'm not sure what you meant by compressing Huffman node states instead of characters. And are you feeding the output of Huffman to a range coder, or are you using Huffman in place of the range coder? TBH, I haven't experimented much with that sort of thing (modeling, arithmetic coding, range coding). I've stuck mainly with algorithms that are simple enough to implement from scratch, and with working on various algorithmic problems that are somehow related to compression (like MTF). I haven't taken a shot at implementing an entire high-end compressor, with all of the complex modeling and tricks and tweaks that go with that. I may eventually get there, after I've worked my way through all the lower-level stuff. But it seems like there's already a lot of competition there. And my training was comparatively light in ML and AI. (I've gravitated more toward discrete math and computer systems-type stuff.)

    Wavelet trees may be worth a try instead of Huffman. I'm not sure exactly how they'd fit in to your workflow, but they can get compression that's at least comparable to Huffman. And they have some ability to adapt to local conditions. They're not hard to implement -- I implemented them more than once for experimentation a little while ago. They can actually be used in combination with Huffman (Huffman-shaped trees), but the biggest contribution to compression comes from doing RLE on the bits that come out. The tree shape -- somewhat surprisingly -- doesn't seem to make a big difference.
    Last edited by nburns; 27th June 2013 at 02:37.

  18. #16
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Well, I am looking for structure, which would fit actual freq. distribution, or percentage share, for example first four symbols could make up 90% of file. So I'd like to use Huffman to help me make such division (e.g. according same or similar code length). I think it may be useful to create groups based e.g. on code lengths and use bin. search or such with may even slightly fit prob. distribution but should remain compressible still.

    Huffman groups symbols into overall balanced nodes, like two symbols which make up 60% and 0,1% of file are unlikely to end up within single or close node but if they are balanced like it Huffman does, they are unlikely compress further.

    It seems there may exist rules which actually help internal grouping at same code lengths.. so now I am experimenting with these like when number of symbols within a group is even,
    avoiding needles combinations and grouping direction seems matter.
    Last edited by quadro; 27th June 2013 at 11:47.

  19. #17
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I'm still not sure I follow you all that well. What Huffman is for, as I see it, is to take fragments of information that information theory says should take up fractional numbers of bits and pack them into codes that consume whole numbers of bits. Huffman is guaranteed to find the optimal packing. Any time you have a symbol that makes up 60% of a file, Huffman will assign it a 1-bit code. -lg(0.6)=0.74, which means it should get about 0.74 bits, but the smallest possible code is one bit.

  20. #18
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Every bit forms separate context, so there need to be whole number of bits to encode Huff. codes in this fashion There is some redundancy, anyway compress Huffman codes once more, so there are some little gains. Also there can be small gain at lower nodes from character reordering.

    I divided whole stream on something like exponent or the index and mantissa.

    Mantissa is uncompressed and I use something like binary search over it. Each new index represent group of symbols which have same Huffman code lengths.

    And I just added new Huffman tree, which is made of overall prob of every group and this is actual index or pointer to every group. I think I will stick to this for now, looks quite promising.
    Last edited by quadro; 27th June 2013 at 19:59.

  21. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by quadro View Post
    Every bit forms separate context, so there need to be whole number of bits to encode Huff. codes in this fashion There is some redundancy, anyway compress Huffman codes once more, so there are some little gains. Also there can be small gain at lower nodes from character reordering.
    It sounds like you could be on the way to reinventing the Huffman-shaped wavelet tree.

    The wavelet tree isolates each bit -- in this case, Huffman tree node -- and enumerates it separately. If these are the counts and Huffman codes:

    4: 0
    2: 10
    1: 110
    1: 111

    And this is the Huffman-coded file:

    0 110 10 0 10 111 0 0

    The wavelet tree looks like this:

    01101100 1001 01

    The first group is the first bit from each coded symbol. The next group is all the second bits (from the ones that have at least two bits). The last group is all the third bits. The other thing is that in each group, the bits are divided into subgroups based on the bits that led up to it, and the subgroups are in sorted order. I didn't make the example complex enough to show this.

    It's like doing radix sort while saving a snapshot of the bit you're about to sort on, right before each round of sorting. It's also a similar idea to the BWT.

    I divided whole stream on something like exponent or the index and mantissa.

    Mantissa is uncompressed and I use something like binary search over it. Each new index represent group of symbols which have same Huffman code lengths.

    And I just added new Huffman tree, which is made of overall prob of every group and this is actual index or pointer to every group. I think I will stick to this for now, looks quite promising.

  22. #20
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    I might try wawelet tree, but so far it seems any prefix code I tried is often beaten by an escaped version by a bit margin, even if they do not have Huffmann lenght and do not fit the file so well.

    for example 1 (5+), 00 (0), 010(1) 0110(2), 01110(3), 01111(4) just gets better than whole file Huffman code lenghts with preffix code.

    I think Huffman code lengths from whole file just do not provide enough details which might get handy. So decided to use a multiple escaped Huff-trees for a constant(not always) character count and It seems perform bit better than previous cyclic code.

    For example groups of 8, 9 and 10 characters etc and an escape symbol. But I have not fixed character count, because it seems its better to encode goups of 1 & 2 characters as a single group of 3 and group do not always correspond to huffman code lengts, because it might be more convenient to divide single group of 6 characters into 2 & 4; 7 on 3 & 4, so on so characters from a lower group might be assigned to upper one.

    9+ escape10+escape8+escape9+escape
    0: 1 0 0: 1 1 0 1: 1 0 0 0 0: 1 0 0 1
    1: 1 0 0 1: 0 0 1 2: 1 1 0 0 1: 0 1 0 1
    2: 1 0 1 2: 0 1 1 3: 0 0 1 0 2: 1 1 0 1
      3: 1 1 1 4: 1 0 1 0 3: 0 0 1 1
    3: 1 0 0 0  
    4: 1 0 0 1 4: 1 0 1 0 5: 0 0 0 0 0 4: 1 0 1 1
      5: 0 1 0 1 6: 1 0 0 0 0 5: 0 1 1 1
    5: 0 0 0 0 0 6: 1 1 0 1 7: 0 0 1 0 0 
    6: 1 0 0 0 0  8: 1 0 1 0 0 6: 1 1 1 1
    7: 0 0 0 0 1 7: 0 0 0 1 0  7: 0 0 0 0 1
    8: 1 0 0 0 1 8: 1 0 0 1 0  8: 1 0 0 0 1
      0: 1 1 0  

    Each column represents a Huff-tree code lengts made for 9 consecutive symbols and than there is escape symbol, which cost is 1 bit, added to the top. Huffman codes are actualy not encoded, but they help to divide symbols on groups. And symbols in every group should be assign less bit-counts if possible.
    Last edited by quadro; 28th June 2013 at 12:36.

  23. #21
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    There are lots of steps in your algorithm, and I'm a little lost. I gather that you're taking the output of qlfc, minus the runs, and separately encoding the exponents. Then you're doing something similar for the run lengths. Or, are you reincorporating the run lengths with the qlfc symbols?

    Without getting too detailed, qlfc seems to be a way of taking MTF and adjusting the ranks to get a better distribution. Is that right? So the output has the same length and range of values as MTF?

    Which part are you using Huffman for? The exponents of the qlfc symbols?

    I'm not sure what you're doing in the last post with consecutive symbols and escapes. What do the symbols represent? Exponents from qlfc? Exponents of run lengths? Are the consecutive symbols runs of equal symbols, or are you factoring out patterns that appear more than once? I don't think they could be equal-symbol runs, because I think you're separating runs before qlfc and treating them as integers. The run lengths themselves, or the logs of the run lengths, could have short runs in them. But those runs are very unpredictable in my experience.

    Please correct whatever I've misstated. It sounds like you're still hashing out a lot of the details. It would help if you could restate the concept and the steps from the beginning, so I could see what parts have changed or not.

  24. #22
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Yes basically it does, I think both RLE and coding of all the other characters can be viewed both as kind of RLE, or freq. encoding with binary output in this case, if we treat every character like runs of zeros or vice-versa, there is no big difference.

    Separation of consecutive characters can give some advantages, like memory and time, but it's what you prefer like RLE-bit or RLE-exp.

  25. #23
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Here are some preliminary results:

    enwik8bwtQLFC20,970,792
    enwik8bwtMTF22,590,363
    enwik9bwtQLFC170,407,607
    enwik9bwMTF184,052,747

    There is a quasi-static RC followed by Huff-len optimized, pre-computed tree from enwik8bwt into depth of 36 characters, than repeatedly or recursively the same.

    4 static RLE stages, bit tuned for qlfc(not MTF), anyway I'd like to add a more dynamic RLE in future to boost it a bit.

  26. The Following User Says Thank You to quadro For This Useful Post:

    Bulat Ziganshin (10th May 2016)

Similar Threads

  1. MTF and DC on large alphabets via sorting
    By nburns in forum Data Compression
    Replies: 12
    Last Post: 11th June 2013, 19:28
  2. MTF and Coroutines and Codec APIs
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 4th December 2010, 12:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •