# Thread: MTF and DC on large alphabets via sorting

1. ## MTF and DC on large alphabets via sorting

It's possible to compute move-to-front on arbitrarily large alphabets in O(n lg n) time by sorting a permutation. I've seen O(n lg m) algorithms proposed (alphabet size m), but they store the symbols in a tree. Trees can be difficult to implement efficiently*. For the BWT, MTF typically uses linear search over an array, which is extremely fast for byte alphabets, since the array is very small (just 4 cache lines on a current x86 cpu). Combined with the fact that MTF makes the searches shorter, an array is probably the fastest solution for the BWT. For alphabets that are arbitrarily large, such as when the symbols are positions in the sequence, this approaches O(n^2) and becomes impractical.

MTF can be computed by creating a suitable permutation and finding the inversion vector. The inversion vector is a byproduct of sorting, and you can find it by adapting a stable sort algorithm. This technique also works for distance coding. I've attached an implementation that uses mergesort.

The disadvantage of this approach is having to work with a permutation of n, which might increase the size of the data.

See #10 for an explanation of how MTF and DC are related to inversions.

* Edited 2. ## The Following User Says Thank You to nburns For This Useful Post:

Bulat Ziganshin (19th May 2015)

3. It's possible to compute move-to-front on arbitrarily large alphabets in O(n lg n) time by sorting a permutation. I've seen O(n lg m) algorithms proposed (alphabet size m), but they store the symbols in a tree, which would be inefficient to search and update for any size alphabet.
Inefficient because of what? Sure tree is much more complex than simple array, but sorting is also complex compared to linear search. You can try Splay Trees which can exploit the fact that the MTF distribution is usually very skewed. 4. Originally Posted by Piotr Tarsa Inefficient because of what? Sure tree is much more complex than simple array, but sorting is also complex compared to linear search. You can try Splay Trees which can exploit the fact that the MTF distribution is usually very skewed.
True. You can also sort using a tree. What I think is interesting is to show the relationship between MTF and inversions. An inversion vector is a way to capture a permutation without redundancy. I've shown how to recreate the permutation that generates the inversion vector that is equivalent to MTF. (Interestingly, the inversion vector that comes out is actually Move-to-Back. To get MTF, you have to subtract each element of the IV from the symbol count.) This gives a formal meaning to MTF in terms of permutation inversions, and suggests other ways to compute it. Given the equivalence of MTF to inversions, and inversions to sorting, computing MTF is, in effect, sorting.

The IV algorithm can be separated out as a subroutine and optimized separately, since it has multiple applications. For instance, DC.

The permutation that yields MTF is computed simply: Each element is the index of the previous occurrence of the same symbol. For DC, each element is the next occurrence of the same symbol. (As it happens, the "reverse IV" is more convenient than the IV. The IV counts the number of larger elements that precede the current element. The reverse IV counts the number of smaller elements that follow the current element. You can compute one from the other based only on the current element, in time O(1) per element.)

DC has a very direct relationship to the inversion vector. Given the IV, you get DC by eliminating most of the zeroes, and adding 1 to the leftover elements (to agree with convention). Then the last instance of each symbol is replaced with 0, and some unnecessary elements are trimmed from the end. 5. I've been using libopenbwt for testing MTF and DC. A utility is attached. (It expects libopenbwt to be installed.)

Code:
```~/src\$ mtf_dc_openbwt -m BWT
Writing to BWT.mtf
~/src\$ byte_to_int32 BWT
~/src\$ compute_mtf_dc -m BWT.int32
Finding the largest symbol... 239
Finding links...
Count link backward inversions (BINV[i] = count(S[j] < S[i] where j > i)) for i in [0, len-1]...
Subtracting...
Writing to BWT.int32.mtf...
~/src\$ byte_to_int32 BWT.mtf
~/src\$ cmp BWT.mtf.int32 BWT.int32.mtf
~/src\$
~/src\$ mtf_dc_openbwt -d BWT
Writing to BWT.dc
~/src\$ compute_mtf_dc -d -s 256 BWT.int32
Finding links...
Count link backward inversions (BINV[i] = count(S[j] < S[i] where j > i)) for i in [0, len-1]...
Compacting...
Writing to BWT.int32.dc...
~/src\$ cmp BWT.int32.dc BWT.dc
~/src\$``` 6. Concerning performance:

My implementation of mergesort is probably not the fastest mergesort you could write. You could also substitute something like radix sort. I searched google for inversion vector implementations others had written, but the only open source one I found was very slow. I think it must have been n^2.

For MTF on the BWT, it runs much faster if you eliminate the same-symbol runs before sorting. Of course, that would complicate things, because you'd have to remember where to add them back in later. For large alphabets, runs are much less likely, anyway. 7. ## MTF inverter (plus minor updates)

I've attached an update of the software that includes a tool for inverting MTF. The inverse algorithm and the forward algorithm are basically mirror images of each other.

There are some other minor changes, most notably that I've modified the sort algorithm so that it uses insertion sort for short sequences. This slightly improves performance. 8. Hi, could this be turned into qlfc? I think it's a MTF, but stored bit diferently, so I think that should not be problem.

Also I compared this to my version, but yours seems about 4 times faster, maybe that you process a 32bit integers. http://encode.ru/threads/546-Move-to...ll=1#post22075

BTW on enwik8bwt I got a seg. fault. 9. Originally Posted by quadro Hi, could this be turned into qlfc? I think it's a MTF, but stored bit diferently, so I think that should not be problem.
I'm not familiar with qlfc. I'll take a look.

Also I compared this to my version, but yours seems about 4 times faster, maybe that you process a 32bit integers.
That thread was part of what inspired me to work on this problem and post the result.

I'm not sure how your algorithm works. Just now, I posted a description of how this one works. See a couple of posts below.

BTW on enwik8bwt I got a seg. fault.
This expects the input to be 32-bit integers. If the MTF sequence is stored as bytes, as it usually is when applied to the BWT, it's necessary to convert the bytes into 32-bit integers, otherwise it will typically seg fault (and even if it doesn't, it will give nonsense results). There's a utility included to convert bytes into ints.

It's not the fastest way to encode the BWT. Using mergesort is a way of guaranteeing O(n log n) performance irrespective of alphabet size. Since the BWT has a small and constant alphabet, you can tailor the algorithm to get better performance, and that's effectively what older BWT-specific MTF encoders have done (such as libopenbwt's). There's a correspondence between these tailored algorithms and sorting: libopenbwt's algorithm is similar to doing insertion sort or selection sort, plus taking advantage of the knowledge that the maximum inversion count for any element is bounded by a constant. That kind of approach does extremely well on the BWT, but it would incur the same O(n^2) worst-case performance of insertion sort and selection sort if you tried to apply it to the general case of integer sequences and unbounded alphabets.

(Another thing that could cause this to segfault is if your integer sequence contains extremely large values. The reason is that I took a shortcut by assuming that the alphabet can be mapped onto an array in memory. A workaround would be to remap the values onto a smaller alphabet before running MTF (it's always possible to remap the alphabet so that the maximum value is <= the length of the sequence without causing collisions). I don't think this is the problem you're running into.) 10. Here's a reference I found on qlfc and here is the abstract, with comments inserted:

We propose a novel approach for the second step of the Burrows-Wheeler compression algorithm, based on the idea that the probabilities of events are not continuousvalued, but are rather quantized with respect to a specific class of base functions. The first pass of encoding transforms the input sequence x into sequence ˜x. For each xi = c, the local frequency ˜xi is the number of different symbols until the next occurrence of c (therefore, ˜xi is smaller than the effectively used alphabet size k = |A∗|),
The number of different symbols until the next occurrence of c is equivalent to MTF backwards (MTF gives the number of different symbols since the previous occurrence of c). You can obtain this conversion by reversing the input and applying any MTF algorithm, then reversing again the result so that it's once more in the original order. It's easy to see that the new sequence can be decoded, because each individual step can be reversed.

the ending being handled by extending x with A ∗ in any order. A ∗ symbols are saved, in the order of their first occurrence, to be associated with the links of local frequencies. The second pass models and codes ˜x. When encoding a local frequency value (from a link of local frequencies), the associated symbol is known to both the encoder and the decoder, as opposed to MTF, where rank values lose that information. The symbol context is an exponentially decaying average of its past local frequency values. For faster adaptation, A ∗ is divided in a number of classes and ˜xi is encoded by entropy coding a) its class by order 1 context modeling, using its associated symbol context, and b) the remainder, if needed, by order 0 context modeling with one model
It sounds like the remaining steps can be considered a post-processor for MTF (or, MTF backwards, as seems to be the case). The code that I've posted here could be useful for the initial step, if it uses a large alphabet. If the sequence x is just the output of the BWT, then it would be more efficient to use something else. 11. ## Explanation of the algorithms

MTF:

Convert the sequence of symbols to a sequence of integers, where each element of the sequence contains the index of the previous instance of the same symbol. This is like multiple intertwined linked lists, one list per alphabet symbol, which links together all instances of that alphabet symbol. An example follows (a fixed-width font would have made the illustration much better).

012 3456789
ABCAABAABC
***0314652

Each element then describes an interval, the interval between consecutive instances of the same symbol. The inversion vector I used counts the number of "right" inversions, that is, the number of smaller elements to the right of the current element. For some interval X, the right inversions of X are the intervals that end to the right of X and start to the left of X, that is, the intervals that completely enclose it. These correspond to symbols that don't appear in X, and there can be at most one of these per alphabet symbol. So the inversion vector gives, for each element, the number of different symbols that haven't been seen since the last instance of that symbol. From this number, you can easily get the MTF value: Take the total number of symbols in the alphabet and subtract the number from the inversion vector plus one for the current symbol.

MTF[i] = Alphabet_size - Right_inversions[i] - 1

DC:

Create an integer sequence just as in MTF, except that each element contains the index of the next instance of the current symbol. Find the inversion vector for that sequence, as with MTF: the one that counts right inversions. Now, for some interval X, the right inversions of X are the intervals that start to the right of X and end inside X, that is, the intervals that are completely enclosed by X. This number, the number of intervals enclosed by X, is the number of symbols in X whose next instance is also in X. This is essentially the DC value for X. It excludes the intervals that start in X and end outside of X, of which there is exactly one per different alphabet symbol in X, i.e. the MTF value (this suggests another way to compute MTF). To get the canonical DC sequence, you have to do some brushing up on the sequence. The first thing to notice is that there are lots of extra zeroes. Each same-symbol run will be filled with zeroes, and since DC encodes runs implicitly, these zeroes need to be omitted. For the remaining steps, please refer to my other posts in this thread and the source code. 12. Hi, as for seg. fault on enwick8, I got this working after byte 2 int conversion you included , thanks for PM.

I tried now on enwick8, it was about 4 times slower than my mtf, but maybe just because sort algorithm is not yet ideal. 13. Originally Posted by quadro Hi, as for seg. fault on enwick8, I got this working after byte 2 int conversion you included , thanks for PM.

I tried now on enwick8, it was about 4 times slower than my mtf, but maybe just because sort algorithm is not yet ideal.
I'm glad you got it working. Thanks for posting that.

It's not hard to beat this implementation for the BWT on bytes. The implementation in openbwt is 6 times faster on enwik8.

In order to see a benefit from this approach, the sequence must have a large alphabet. 14. This also seems to be faster on random bytes. This algorithm has better worst-case behavior than the one in openbwt.

This isn't to say that it couldn't be made faster. But I'm not sure it will ever be the right algorithm for the BWT. Especially when you consider that MTF is not likely to be the bottleneck when using the BWT. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•