Results 1 to 13 of 13

Thread: MTF and DC on large alphabets via sorting

  1. #1
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts

    MTF and DC on large alphabets via sorting

    It's possible to compute move-to-front on arbitrarily large alphabets in O(n lg n) time by sorting a permutation. I've seen O(n lg m) algorithms proposed (alphabet size m), but they store the symbols in a tree. Trees can be difficult to implement efficiently*. For the BWT, MTF typically uses linear search over an array, which is extremely fast for byte alphabets, since the array is very small (just 4 cache lines on a current x86 cpu). Combined with the fact that MTF makes the searches shorter, an array is probably the fastest solution for the BWT. For alphabets that are arbitrarily large, such as when the symbols are positions in the sequence, this approaches O(n^2) and becomes impractical.

    MTF can be computed by creating a suitable permutation and finding the inversion vector. The inversion vector is a byproduct of sorting, and you can find it by adapting a stable sort algorithm. This technique also works for distance coding. I've attached an implementation that uses mergesort.

    The disadvantage of this approach is having to work with a permutation of n, which might increase the size of the data.

    See #10 for an explanation of how MTF and DC are related to inversions.

    * Edited
    Attached Files Attached Files
    Last edited by nburns; 4th June 2013 at 00:10.

  2. The Following User Says Thank You to nburns For This Useful Post:

    Bulat Ziganshin (19th May 2015)

  3. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    It's possible to compute move-to-front on arbitrarily large alphabets in O(n lg n) time by sorting a permutation. I've seen O(n lg m) algorithms proposed (alphabet size m), but they store the symbols in a tree, which would be inefficient to search and update for any size alphabet.
    Inefficient because of what? Sure tree is much more complex than simple array, but sorting is also complex compared to linear search. You can try Splay Trees which can exploit the fact that the MTF distribution is usually very skewed.

  4. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Inefficient because of what? Sure tree is much more complex than simple array, but sorting is also complex compared to linear search. You can try Splay Trees which can exploit the fact that the MTF distribution is usually very skewed.
    True. You can also sort using a tree. What I think is interesting is to show the relationship between MTF and inversions. An inversion vector is a way to capture a permutation without redundancy. I've shown how to recreate the permutation that generates the inversion vector that is equivalent to MTF. (Interestingly, the inversion vector that comes out is actually Move-to-Back. To get MTF, you have to subtract each element of the IV from the symbol count.) This gives a formal meaning to MTF in terms of permutation inversions, and suggests other ways to compute it. Given the equivalence of MTF to inversions, and inversions to sorting, computing MTF is, in effect, sorting.

    The IV algorithm can be separated out as a subroutine and optimized separately, since it has multiple applications. For instance, DC.

    The permutation that yields MTF is computed simply: Each element is the index of the previous occurrence of the same symbol. For DC, each element is the next occurrence of the same symbol. (As it happens, the "reverse IV" is more convenient than the IV. The IV counts the number of larger elements that precede the current element. The reverse IV counts the number of smaller elements that follow the current element. You can compute one from the other based only on the current element, in time O(1) per element.)

    DC has a very direct relationship to the inversion vector. Given the IV, you get DC by eliminating most of the zeroes, and adding 1 to the leftover elements (to agree with convention). Then the last instance of each symbol is replaced with 0, and some unnecessary elements are trimmed from the end.
    Last edited by nburns; 27th April 2013 at 08:47.

  5. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I've been using libopenbwt for testing MTF and DC. A utility is attached. (It expects libopenbwt to be installed.)

    Code:
    ~/src$ mtf_dc_openbwt -m BWT
    Writing to BWT.mtf
    ~/src$ byte_to_int32 BWT
    ~/src$ compute_mtf_dc -m BWT.int32
    Finding the largest symbol... 239
    Finding links...
    Count link backward inversions (BINV[i] = count(S[j] < S[i] where j > i)) for i in [0, len-1]...
    Subtracting...
    Writing to BWT.int32.mtf...
    ~/src$ byte_to_int32 BWT.mtf
    ~/src$ cmp BWT.mtf.int32 BWT.int32.mtf
    ~/src$
    ~/src$ mtf_dc_openbwt -d BWT
    Writing to BWT.dc
    ~/src$ compute_mtf_dc -d -s 256 BWT.int32
    Finding links...
    Count link backward inversions (BINV[i] = count(S[j] < S[i] where j > i)) for i in [0, len-1]...
    Compacting...
    Writing to BWT.int32.dc...
    ~/src$ cmp BWT.int32.dc BWT.dc
    ~/src$
    Attached Files Attached Files
    Last edited by nburns; 27th April 2013 at 01:10.

  6. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Concerning performance:

    My implementation of mergesort is probably not the fastest mergesort you could write. You could also substitute something like radix sort. I searched google for inversion vector implementations others had written, but the only open source one I found was very slow. I think it must have been n^2.

    For MTF on the BWT, it runs much faster if you eliminate the same-symbol runs before sorting. Of course, that would complicate things, because you'd have to remember where to add them back in later. For large alphabets, runs are much less likely, anyway.

  7. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts

    MTF inverter (plus minor updates)

    I've attached an update of the software that includes a tool for inverting MTF. The inverse algorithm and the forward algorithm are basically mirror images of each other.

    There are some other minor changes, most notably that I've modified the sort algorithm so that it uses insertion sort for short sequences. This slightly improves performance.
    Attached Files Attached Files
    Last edited by nburns; 21st May 2013 at 23:18.

  8. #7
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Hi, could this be turned into qlfc? I think it's a MTF, but stored bit diferently, so I think that should not be problem.

    Also I compared this to my version, but yours seems about 4 times faster, maybe that you process a 32bit integers. http://encode.ru/threads/546-Move-to...ll=1#post22075

    BTW on enwik8bwt I got a seg. fault.

  9. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by quadro View Post
    Hi, could this be turned into qlfc? I think it's a MTF, but stored bit diferently, so I think that should not be problem.
    I'm not familiar with qlfc. I'll take a look.

    Also I compared this to my version, but yours seems about 4 times faster, maybe that you process a 32bit integers.
    That thread was part of what inspired me to work on this problem and post the result.

    I'm not sure how your algorithm works. Just now, I posted a description of how this one works. See a couple of posts below.

    BTW on enwik8bwt I got a seg. fault.
    This expects the input to be 32-bit integers. If the MTF sequence is stored as bytes, as it usually is when applied to the BWT, it's necessary to convert the bytes into 32-bit integers, otherwise it will typically seg fault (and even if it doesn't, it will give nonsense results). There's a utility included to convert bytes into ints.

    It's not the fastest way to encode the BWT. Using mergesort is a way of guaranteeing O(n log n) performance irrespective of alphabet size. Since the BWT has a small and constant alphabet, you can tailor the algorithm to get better performance, and that's effectively what older BWT-specific MTF encoders have done (such as libopenbwt's). There's a correspondence between these tailored algorithms and sorting: libopenbwt's algorithm is similar to doing insertion sort or selection sort, plus taking advantage of the knowledge that the maximum inversion count for any element is bounded by a constant. That kind of approach does extremely well on the BWT, but it would incur the same O(n^2) worst-case performance of insertion sort and selection sort if you tried to apply it to the general case of integer sequences and unbounded alphabets.

    (Another thing that could cause this to segfault is if your integer sequence contains extremely large values. The reason is that I took a shortcut by assuming that the alphabet can be mapped onto an array in memory. A workaround would be to remap the values onto a smaller alphabet before running MTF (it's always possible to remap the alphabet so that the maximum value is <= the length of the sequence without causing collisions). I don't think this is the problem you're running into.)
    Last edited by nburns; 3rd June 2013 at 10:41.

  10. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Here's a reference I found on qlfc and here is the abstract, with comments inserted:

    We propose a novel approach for the second step of the Burrows-Wheeler compression algorithm, based on the idea that the probabilities of events are not continuousvalued, but are rather quantized with respect to a specific class of base functions. The first pass of encoding transforms the input sequence x into sequence ˜x. For each xi = c, the local frequency ˜xi is the number of different symbols until the next occurrence of c (therefore, ˜xi is smaller than the effectively used alphabet size k = |A∗|),
    The number of different symbols until the next occurrence of c is equivalent to MTF backwards (MTF gives the number of different symbols since the previous occurrence of c). You can obtain this conversion by reversing the input and applying any MTF algorithm, then reversing again the result so that it's once more in the original order. It's easy to see that the new sequence can be decoded, because each individual step can be reversed.

    the ending being handled by extending x with A ∗ in any order. A ∗ symbols are saved, in the order of their first occurrence, to be associated with the links of local frequencies. The second pass models and codes ˜x. When encoding a local frequency value (from a link of local frequencies), the associated symbol is known to both the encoder and the decoder, as opposed to MTF, where rank values lose that information. The symbol context is an exponentially decaying average of its past local frequency values. For faster adaptation, A ∗ is divided in a number of classes and ˜xi is encoded by entropy coding a) its class by order 1 context modeling, using its associated symbol context, and b) the remainder, if needed, by order 0 context modeling with one model
    It sounds like the remaining steps can be considered a post-processor for MTF (or, MTF backwards, as seems to be the case). The code that I've posted here could be useful for the initial step, if it uses a large alphabet. If the sequence x is just the output of the BWT, then it would be more efficient to use something else.
    Last edited by nburns; 3rd June 2013 at 10:43. Reason: Fixed font and color

  11. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts

    Explanation of the algorithms

    MTF:

    Convert the sequence of symbols to a sequence of integers, where each element of the sequence contains the index of the previous instance of the same symbol. This is like multiple intertwined linked lists, one list per alphabet symbol, which links together all instances of that alphabet symbol. An example follows (a fixed-width font would have made the illustration much better).

    012 3456789
    ABCAABAABC
    ***0314652

    Each element then describes an interval, the interval between consecutive instances of the same symbol. The inversion vector I used counts the number of "right" inversions, that is, the number of smaller elements to the right of the current element. For some interval X, the right inversions of X are the intervals that end to the right of X and start to the left of X, that is, the intervals that completely enclose it. These correspond to symbols that don't appear in X, and there can be at most one of these per alphabet symbol. So the inversion vector gives, for each element, the number of different symbols that haven't been seen since the last instance of that symbol. From this number, you can easily get the MTF value: Take the total number of symbols in the alphabet and subtract the number from the inversion vector plus one for the current symbol.

    MTF[i] = Alphabet_size - Right_inversions[i] - 1

    DC:

    Create an integer sequence just as in MTF, except that each element contains the index of the next instance of the current symbol. Find the inversion vector for that sequence, as with MTF: the one that counts right inversions. Now, for some interval X, the right inversions of X are the intervals that start to the right of X and end inside X, that is, the intervals that are completely enclosed by X. This number, the number of intervals enclosed by X, is the number of symbols in X whose next instance is also in X. This is essentially the DC value for X. It excludes the intervals that start in X and end outside of X, of which there is exactly one per different alphabet symbol in X, i.e. the MTF value (this suggests another way to compute MTF). To get the canonical DC sequence, you have to do some brushing up on the sequence. The first thing to notice is that there are lots of extra zeroes. Each same-symbol run will be filled with zeroes, and since DC encodes runs implicitly, these zeroes need to be omitted. For the remaining steps, please refer to my other posts in this thread and the source code.
    Last edited by nburns; 3rd June 2013 at 22:56.

  12. #11
    Member
    Join Date
    Apr 2010
    Location
    CZ
    Posts
    81
    Thanks
    5
    Thanked 7 Times in 5 Posts
    Hi, as for seg. fault on enwick8, I got this working after byte 2 int conversion you included , thanks for PM.

    I tried now on enwick8, it was about 4 times slower than my mtf, but maybe just because sort algorithm is not yet ideal.
    Last edited by quadro; 11th June 2013 at 13:43.

  13. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by quadro View Post
    Hi, as for seg. fault on enwick8, I got this working after byte 2 int conversion you included , thanks for PM.

    I tried now on enwick8, it was about 4 times slower than my mtf, but maybe just because sort algorithm is not yet ideal.
    I'm glad you got it working. Thanks for posting that.

    It's not hard to beat this implementation for the BWT on bytes. The implementation in openbwt is 6 times faster on enwik8.

    In order to see a benefit from this approach, the sequence must have a large alphabet.

  14. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    This also seems to be faster on random bytes. This algorithm has better worst-case behavior than the one in openbwt.

    This isn't to say that it couldn't be made faster. But I'm not sure it will ever be the right algorithm for the BWT. Especially when you consider that MTF is not likely to be the bottleneck when using the BWT.
    Last edited by nburns; 11th June 2013 at 19:37.

Similar Threads

  1. bsc, new block sorting compressor
    By Gribok in forum Data Compression
    Replies: 363
    Last Post: 14th June 2017, 20:55
  2. Segmenting after block-sorting
    By Piotr Tarsa in forum Data Compression
    Replies: 1
    Last Post: 2nd September 2011, 02:27
  3. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 20:01
  4. MTF and Coroutines and Codec APIs
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 4th December 2010, 12:09
  5. Block sorting for LZ compression
    By Bulat Ziganshin in forum Forum Archive
    Replies: 15
    Last Post: 14th April 2007, 15:37

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •