Results 1 to 30 of 30

Thread: The BWT is redundant...

  1. #1
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts

    The BWT is redundant...

    I've been trying to come to a thorough understanding of the BWT and suffix arrays. This is a long term project. Conceptually, they are fairly simple, but there are a lot of details.

    There are two basic sources of information: academic literature and forums like this. The academic literature is rigorous and well-documented. This forum has contributors, for instance BWTS, with great ideas. In one case, too much jargon and formalisms can get in the way, in the other case, too little formality is the issue. In one case, the articles are easy to find, but aren't always free, in the other case, the articles are scattered in obscure locations, or don't exist at all. I try to read both.

    Here's some information that I've found on the redundancy/waste in suffix arrays and some observations I've made on redundancy in the BWT. It might not be new, but it's not exactly easy to find. This forum seems to cover mostly advanced statistical models in compression, but I'm more interested in discrete math and this seems like the right place to talk BWT.

    Suffix arrays:

    I've seen some papers in the academic literature (here, here) that deal with counting the number of suffix arrays, and defining the subset of permutations of n that are valid suffix arrays (over a fixed alphabet, with various constraints). It's not hard to see that, for the usual case of files over byte alphabets with n >> 256, not all permutations of n (=length) can be distinct suffix arrays; in fact, most can't. The easiest way to see this is that the number of general permutations of n is n!, while the number of files of n bytes is 256^n. Since each file maps to a single suffix array, and the number of files is smaller than the number of permutations of n (for increasing n), there must be permutations that are not suffix arrays. Another way to see this is that there is a correspondence between suffix arrays and BWTs. Assuming that a BWT is unique to a particular file, then a BWT must correspond to a single (not necessarily unique) suffix array. In general, the number of distinct BWTs is an upper bound on the number of suffix arrays. BWTs of sufficient length always contain repeated characters, and since repeats are indistinguishable, the number of distinct suffix arrays of a file in which some character c repeats k times must be limited by n!/k!. So that gives insight into a tighter bound on the number of suffix arrays: the number is not nearly the number of general permutations of n, it's bounded by the much smaller number of ways to arrange a multiset (set with repeats) of size n.

    The BWT:

    Considering that there are permutations of n that are not valid suffix arrays, I wondered if there are arrangements of characters that are not valid BWTs. The answer is yes. By the way the BWT is defined and used, a BWT must describe a permutation with exactly one cycle. Walking all the way through that cycle yields back the original file. But not every arrangement of characters, when treated as a BWT, describes a single cycle. Consider this example, with the BWT on the left and the sorted sequence on the right:

    c . a
    b . b
    a . c

    This "BWT" contains two cycles. Starting at c in the first row, you can never get to b in the second row. Starting at b, you keep coming back to the same place and never get to either a or c. So the sequence "cba" can't be a valid BWT.

    There are lots of invalid BWTs. No sorted sequence of characters can be a BWT:

    a . a
    a . a
    b . b

    In general, a sorted sequence, treated as a BWT, will contain as many cycles as letters.

    Since the number of BWTs is strictly smaller than the number of ways to arrange the same letters/characters in an arbitrary sequence, the BWT must be redundant. It turns out that, at least in some cases, not all the letters are necessary. The following sequence is drawn from the 3-letter alphabet abc:

    b
    a
    ?

    The missing letter in the bottom row has to be "a." Either b or c would make the last letter a self-cycle.

    The loss in information going from a general sequence with arbitrarily many cycles to a single cycle would generally be a lot, but I think much of the difference gets absorbed by the repeating characters, i.e., not all partitionings into cycles would be possible in reality.

    It seems intuitively clear that if the BWT was a 1:1 mapping from a sequence to a permutation of that sequence (i.e. a bijection), for all sequences, the information would be perfectly balanced. The permutation would neither add nor subtract information; it would have no more redundancy than the source. The BWT loses information about the starting position, which has to be resolved by adding something external. The amount of information in the start position must account for the lost/redundant information in the BWT itself. The start position adds at most lg n bits (actually, somewhat more, since the number needs its own encoding). The requirement that the BWT sequence must be a single cycle limits where it can end; for instance, it can't end with the largest character in the sequence. So it encodes within it some information on its own length. But it's not self-terminating in the general case. So you can't eliminate the length information, either.

    Scott, of the BWTS, has dealt with the non-bijectivity of the BWT, and has a solution. The BWTS is looking good to me as perhaps a better replacement for the BWT. What I have not come across is a lot of documentation, and I'm not sure whether or not the bijection has received a formal proof. I'd like to refine what I know about the BWTS, and, maybe, the proof is not that hard.
    Last edited by nburns; 27th March 2013 at 21:55. Reason: Revised

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    I've been trying to come to a thorough understanding of the BWT and suffix arrays. This is a long term project. Conceptually, they are fairly simple, but there are a lot of details.

    ....

    Scott, of the BWTS, has dealt with the non-bijectivity of the BWT, and has a solution. The BWTS is looking good to me as perhaps a better replacement for the BWT. What I have not come across is a lot of documentation, and I'm not sure whether or not the bijection has received a formal proof. I'd like to refine what I know about the BWTS, and, maybe, the proof is not that hard.
    I would say that the German guy did a currently blessed proof Prague Stringology Conference, pp. 65?69
    which I think is at http://arxiv.org/abs/0908.0239
    Also the one Gil and I did is there not sure it's totally cleaned up but it's here http://arxiv.org/abs/1201.3077

    Saw it used by a Chinese guy a Japanese guy wrote faster code on this site. So I think only people
    at this site have used it. Oh and saw a Garfield cartoon thing where it was used. I would speculate that
    the Russian or Chinese intelligence agencies use some binary version of it pre pass to encryption since
    that may be one of its best uses. But its highly unlikely that it would be used by the Americans or British
    since they already think they know everything. But that's just my view of world events.

    Also I did at one time the slow reverse way that shows how to do the reverse BWTS of any string. It's
    that process alone that completely defines the relationship between the forward and reverse BWTS in
    my mind making it self evident that its bijective and unique. You don't need to know about lyndon words
    to see this.

    When using the binary version it's best to tack a ZERO if front and a ONE in back and then drop
    the ONE in front and ZERO in back after the BWTS. Again that's just my view.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I wrote a simple description of BWTS with an example here: http://mattmahoney.net/dc/dce.html#Section_559

    There is not much difference in compression compared to BWT. With BWTS you don't need to code the starting index, which saves a few bytes. However a BWTS requires splitting the input into Lyndon words (blocks that lexicographically precede any of its rotations) before context sorting. The contexts wrap around at the Lyndon word boundaries, so some of the high order context is lost there. However, there are usually not a lot of Lyndon words so the loss is small. The difference could go either way.

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Matt I like your example for BANANA you form the Lyndon words B AN AN A you break it into 4 words. However if you have repeated Lyndon words you can break the string into fewer strings by combining repeated Lyndon words. It also is a unique splitting of the string and in this case you get B ANAN A You will get the same BWTS using either set but depending on how you write your code one way or the other my be faster.

    Matt in your example of BWT versus BWTS you calculate the True BWT for BANANA which is (NNBAAA,2) Which is correct, Yet
    in section 5.5 you use what people usually use instead of BWT which gives "ANNB$AA" which could be written as (ANNBAA,4) which compares to mine of ANNBAA except for the index. Your big test compares BWTS not to BWT but to something of the form in 5.5 which though close is not BWT why not test all 3 ways. I know why people don't use real BWT its slow but you could test it in your example.

  5. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    I would say that the German guy did a currently blessed proof Prague Stringology Conference, pp. 65?69
    which I think is at http://arxiv.org/abs/0908.0239
    Also the one Gil and I did is there not sure it's totally cleaned up but it's here http://arxiv.org/abs/1201.3077
    I'm glad you replied. I think I've seen both of those papers before. The one by Kufleitner is a bit long on notation, short on explanation. All of the information is probably in there, but I'd need to spend some time looking up the definition of each symbol, before I could even be sure exactly what claims are being made. I don't really doubt that it's bijective, but I'd like to gain more insight into it. The other paper explains things better in plain english, but I'd need to spend more time there, too. There seem to be a lot of different ways to approach this topic, and each has different names for things. The auxiliary permutation, for instance, is probably the same thing that other papers call the "last to first" function or LF (the papers I read use small words, heh), but I'd like to confirm that. It could also be the inverse.

    Ever since I first read about the BWTS, I wondered if there might be an easier way to get there, through a single suffix sort. I haven't been able to reach a definitive answer, but it still looks promising to me. Given the inverse suffix array*, if you were to recursively find the smallest element in the initial portion of the array, mark that point, and then cut off the "tail," would that give you the Lyndon factorization? Then, I suspect it may be possible to use the inverse suffix array of the whole file to figure out what the merger of all the individual cycles would look like. If you imagine working backward from the smallest Lyndon word, every time you cut off the end and form a new cycle out of the next-to-last word, the new last element (i.e. in the inverse suffix array) is followed by the smallest element in the sequence, both before and after. So, that should not change the sort order up to that point. Unfortunately, it looks like it can change the order after that point, by causing a few elements at the end to get bigger, possibly sending them lower in the sort. But, I'm thinking, optimist that I am, that you could do the rearrangements in a final pass over the array, without having to do anything hard. If everything worked out, you'd save the initial Lyndon factorization, and one suffix sort over the whole sequence would be practically all of the sorting it would need. It says in the paper you co-wrote that in a lot of cases, the BWTS comes out the same as the BWT except for a few small changes. It stands to reason that the algorithm might not be all that different, either. Have I missed something? Or do you think there's a chance?

    * Since terminology varies: by this I mean the array that contains the absolute sort index for every position in the file.

  6. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I wrote a simple description of BWTS with an example here: http://mattmahoney.net/dc/dce.html#Section_559
    Thanks.

    There is not much difference in compression compared to BWT. With BWTS you don't need to code the starting index, which saves a few bytes. However a BWTS requires splitting the input into Lyndon words (blocks that lexicographically precede any of its rotations) before context sorting. The contexts wrap around at the Lyndon word boundaries, so some of the high order context is lost there. However, there are usually not a lot of Lyndon words so the loss is small. The difference could go either way.
    There are some empirical results that show a net improvement. It's mentioned in Scott's paper. I have a suspicion that the Lyndon words never break across meaningful contexts, and that there must be a straightforward explanation. Alas, I still don't have the explanation.

  7. #7
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    Thanks.



    There are some empirical results that show a net improvement. It's mentioned in Scott's paper. I have a suspicion that the Lyndon words never break across meaningful contexts, and that there must be a straightforward explanation. Alas, I still don't have the explanation.

    I have to agree with Matt the savings is really really small. The advantage is that is bijective. One other thing the savings in the paper I wrote with Gil ( actually he wrote it. I can't write worth shit.) I took an example from the web site Mark Nelson let me have where I posted in compared to Daneil A. Nagy's 2 state binary BWT. I never had the code Nagy used the numbers where taken from his paper. I decided to do the 2 state binary since there are fewer follow on methods to use. I tried to do simple follow on stages and then compared the results to what he published. It's possible he used poorer second stage methods. So the savings where mine look better than they should have if we used the same follow on routines. Matts comparsion is a far better example of the two methods.

    There is an other effect where sometimes its best to view a file in the reverse direction this can change the results. Also the fact many files are not stationary chopping the file up into several segments can improve compression with several BWT and several indexes. Of course you could save on the indexes by using BWTS.

    I never met Gil or Kufleitner, One lives in Israel the other Germany. I was going to go to Prague and Israel where I could have meet both. But I am an old poor guy with bad health and my wife got scared so never did. I dream far more than I code and am not good at explanations. I had hoped to met others I wish I could see Matt and Mark and Shelwien and even one of the RSA guys that Gil knew but that is all getting farther and farther away. I do worry about the future of mankind I think it is bleak.

  8. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    Matt I like your example for BANANA you form the Lyndon words B AN AN A you break it into 4 words. However if you have repeated Lyndon words you can break the string into fewer strings by combining repeated Lyndon words. It also is a unique splitting of the string and in this case you get B ANAN A You will get the same BWTS using either set but depending on how you write your code one way or the other my be faster.
    I think B AN AN A is the true Lyndon factorization. According to the link below, and some other pages I've seen, Lyndon words must be aperiodic. Lyndon words are a special case of necklaces. B ANAN A is apparently called the necklace factorization (see).

    This site has a Lyndon factorization widget that's fun to play with.
    http://www.allisons.org/ll/AlgDS/Strings/Factors/

    I "borrowed" the javascript from that page and translated the factorization algorithm into into C, which is attached.*

    * Edit: see the version in my next post, which treats all bytes as unsigned
    Last edited by nburns; 4th April 2013 at 19:30.

  9. #9
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    I think B AN AN A is the true Lyndon factorization. According to the link below, and some other pages I've seen, Lyndon words must be aperiodic. Lyndon words are a special case of necklaces. B ANAN A is apparently called the necklace factorization (see).

    This site has a Lyndon factorization widget that's fun to play with.
    http://www.allisons.org/ll/AlgDS/Strings/Factors/

    I "borrowed" the javascript from that page and translated the factorization algorithm into into C, which is attached.
    Of course B AN AN A is the unique factorization into Lyndon words for BANANA
    I don't know what you call B ANAN A but either grouping can be used to calculate the same BWTS.
    The only difference between the two is that you group together identical Lyndon words when you do a lyndon break up and identical lyndon words can only occur together in such a beak up.

  10. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    Of course B AN AN A is the unique factorization into Lyndon words for BANANA
    I don't know what you call B ANAN A but either grouping can be used to calculate the same BWTS.
    Oh, I misread. ... The canonical BWTS construction, so to speak, uses the Lyndon factorization, but you're saying that you can substitute the other factorization without changing the output sequence. If it's faster, you're probably talking about when you pass it to the sort. I think you could get even cleverer and only sort one of the copies. Then you'd just repeat each letter the right number of times when writing out the sequence. But repeating Lyndon words isn't something that happens a lot in real data -- at least, not anything I've tried. My port of the Lyndon algorithm to C tells me that the King James Bible (800k English words) breaks into 13 Lyndon words (I'm guessing they don't happen to be huge duplicated sections). So, barring some other data set that justifies it, it's probably not worth optimizing.

    By the way, I attached a new version of the Lyndon algorithm. This one handles characters > 127 as unsigned.
    Attached Files Attached Files

  11. #11
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I played around some more, and it does seem to be true that the Lyndon factorization is easily contained within the suffix array, and the algorithm I described works to retrieve it. It requires you to add a sentinel at the end of the file before you sort.

    I can see why the BWTS doesn't affect compression much. The number of factors is too small to be significant:

    It's probably not too bad of an approximation to treat the Lyndon word breaks as being random. The first one occurs about halfway through the file, on average. Then the second one occurs about halfway through the first half, and so on. So the number of words is approximately lg(n).

    Being bijective clearly makes it better. But I couldn't come up with a simple way to get the BWTS directly from the suffix array, without doing more sorting. Since the Lyndon factors spill out as a byproduct of the sort, maybe the way to go is to change the suffix sort to generate the BWTS sequence directly. It probably wouldn't slow down the sort, and it's probably worth doing -- by somebody other than me.

    The problem is that the tail ends of the Lyndon words might need to change places if they have matches in other Lyndon words nearer the end. The lengths of these tails is potentially unlimited, so I'm afraid the fix-up on the suffix array would degenerate into redoing a lot of the work of suffix sorting.

    The BWT/S is basically an intermediate representation, anyway. You're going to do more processing on it, so I think you're right that the difference isn't super important.

    Another bijective (apparently) transform, which goes from n bytes <-> n bytes, is the wavelet tree. Have you used this? It seems like a natural complement to the BWT. It's basically the same process as the BWT, as applied to bits, except that it doesn't merge all the bits together. In other words, you have all the 0-bits, then all the 1-bits, etc. They can be concatenated together to make one sequence. It does a nice job after the BWT.
    Last edited by nburns; 31st March 2013 at 07:08.

  12. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I know people have tried applying the BWTS recursively and there was no benefit. This makes a certain amount of sense, since the BWT(S) is already in some sense optimal.

    But, I wonder if anyone has tried applying MTF to the output of the BWTS before the recursive step. This is a little different, because you've got a brand new sequence. The high entropy and low entropy parts both tend to occur in runs*, and MTF translates the high entropy parts into large numbers and the low entropy parts into small numbers. The high numbers would sort to one end, so you might be able to profit from the correlations independent of what letter they came from. MTF has the nice property that there are exactly as many distinct output values as distinct input letters. I'm going to try it eventually.

    Along similar lines, I've got another bijective transform in mind:

    Visit the original sequence backwards, sorting and generating the BWT as you go. For each symbol of the sequence, figure out the MTF value, as you would for the BWT, and output that.

    Note that you sort *as you go,* and you output the values as you generate them, so they come out in the original order (backwards).

    Now, since the MTF values number the same as the original symbols, map them reversibly onto the original set of symbols and substitute those for the numbers.

    It's sort of the unsorted BWT... except that the frequencies of the characters change; the distribution becomes highly skewed. It's a relabeling rather than a permutation.

    The proof that it works is inductive: each step is fully determined by the current symbol and the ones processed ahead of it.

    * If run is the right word. Numbers larger than 1 tend to clump, but of course the values are random.
    Last edited by nburns; 2nd April 2013 at 13:22. Reason: clarify

  13. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Me: "The loss in information going from a general sequence with arbitrarily many cycles to a single cycle would generally be a lot..."

    Actually, I was wrong about that. The number of general permutations (any number of cycles) is n!, and the number of permutations that are one big cycle is (n - 1)!. Of course, we're talking about BWTs, not general permutations, but it appears to be the same, because...

    When the original sequence is not a power of some smaller word, each BWT position is unique. Hence, the position is all you need to get back to an arbitrary sequence. For powers, each position is not unique. I'm actually not sure how the BWT handles this. Is there a special rule?
    Last edited by nburns; 4th April 2013 at 09:41.

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    If a BWT has n identical loops then traversing one loop n times does the correct inverse transform. It doesn't require a check for this special case as long as you check by counting output bytes rather than check for returning to the starting point.

  15. #15
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    If a BWT has n identical loops then traversing one loop n times does the correct inverse transform. It doesn't require a check for this special case as long as you check by counting output bytes rather than check for returning to the starting point.
    Yes the REAL BWT does this you when have a source file that is composed of only a set of repeated fragments. If the file is made up of exactly 5 repeated identical fragments then there will be 5 repeated loops when doing the standard inverse transform for the REAL BWT

    However virtually nobody uses the REAL BWT those use a corrupted version that includes a unique symbol of special weight for End of String or End of File. This prevents repeated loops occurring during the so called BWT that most people use when they pretend they are using a BWT.

    The main reason nobody uses the REAL BWT is that using the corrupted version where you tack on special unique symbol at end is fast to run on computers using current methods. Of course it doesn't return the same answer as a REAL BWT and I guess few use BWTS because it currently is slower than the corrupted versions in current use.

    If a file when decomposed contains only repeated lyndon words the the REAL BWT and the BWTS Transform are identical.

  16. #16
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    If a BWT has n identical loops then traversing one loop n times does the correct inverse transform. It doesn't require a check for this special case as long as you check by counting output bytes rather than check for returning to the starting point.
    That's an excellent point. Thanks.

    If you add a sentinel at the end, how do you ensure that the sentinel is unique?

  17. #17
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    Yes the REAL BWT does this you when have a source file that is composed of only a set of repeated fragments. If the file is made up of exactly 5 repeated identical fragments then there will be 5 repeated loops when doing the standard inverse transform for the REAL BWT
    This version seems to be what suffix sorts like divsufsort use.

    However virtually nobody uses the REAL BWT those use a corrupted version that includes a unique symbol of special weight for End of String or End of File. This prevents repeated loops occurring during the so called BWT that most people use when they pretend they are using a BWT.

    The main reason nobody uses the REAL BWT is that using the corrupted version where you tack on special unique symbol at end is fast to run on computers using current methods. Of course it doesn't return the same answer as a REAL BWT and I guess few use BWTS because it currently is slower than the corrupted versions in current use.
    The alternate version is the real BWT, it's just the BWT of File+sentinel instead of File alone.

    Adding a zero at the end makes it more like the BWTS. The last symbol sorts first in both cases. It makes the sequence more predictable and better behaved. What you really want is a complete solution, like the BWTS, but, like you said, there are obstacles. I proposed that the BWTS be built directly into the suffix sort, so it sorts its input directly into the BWTS.

    If more things were proved about the BWTS, it might get a little more attention. For example: the missing information in the BWT is O(lg n). It looks like the BWT -> BWTS changes the positions of O(lg n) symbols. So, somehow, the information goes directly into the transpositions. It would also help if the BWTS had a fast algorithm. I think a fast algorithm is possible.

  18. #18
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by nburns View Post
    That's an excellent point. Thanks.

    If you add a sentinel at the end, how do you ensure that the sentinel is unique?
    You only have to remember its postion when you do compare. The actual sentinel is never stored but you know where it is because of the index. You then give it a unqiue value larger or smaller than all the symbols that are really used.

  19. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    An alternative is to use a sequence that goes in the original order, but retains the compressibility benefits of the BWT(S). I proposed one in #12. Any comments/critiques? In practice, it would be a lot like PPM, so it isn't necessarily groundbreaking, but the fact that it's a bijection in the spirit of the BWT(S) makes it sort of theoretically interesting.

  20. #20
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    You only have to remember its postion when you do compare. The actual sentinel is never stored but you know where it is because of the index. You then give it a unqiue value larger or smaller than all the symbols that are really used.
    I guess you're talking about if you do your own suffix sort. Someday I might write my own sort, but so far I outsource this part to libdivsufsort, which is easy and fast.

  21. #21
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Nburns actaully Yuta has code quite fast for the BWTS. Buts it not clear to me that you understand the difference bwteen the 3 types of Transforms. Suppose your file a file that is exactly 6 bytes long "ABCABC" What is the REAL BWT.
    Then caclulate the BWTS and then do what 99.9% of programs that claim to do a BWT of a whole file. What do you get??

    Here to help is what I get and I am not to concerned with the index values you get using the two common methods since many ways to define the index.

    Method for REAL BWT sort "ABCABC" use last column
    ABCABC
    ABCABC
    BCABCA
    BCABCA
    CABCAB
    CABCAB You get "CCAABB" with index 0

    know do BWTS
    ABC
    ABC
    BCA
    BCA
    CAB
    CAB You get "CCAABB" but no index compare to real BWT anwser

    know do BWT to string plus special heavy sentinal that most people use

    ABCABC$
    ABC$ABC
    BCABC$A
    BC$ABCA
    CABC$AB
    C$ABCAB
    $ABCABC You get "CAABBC" index 0 you drop the temporary unique sentinal

    What do you get??

  22. #22
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    Nburns actaully Yuta has code quite fast for the BWTS.
    I didn't know about that. I think I just found it on the wikipedia page for the BWT. Thanks.

    know do BWT to string plus special heavy sentinal that most people use

    ABCABC$
    ABC$ABC
    BCABC$A
    BC$ABCA
    CABC$AB
    C$ABCAB
    $ABCABC You get "CAABBC" index 0 you drop the temporary unique sentinal

    What do you get??
    Please do let me know when I say something dumb. But I think here the issue is different conventions for the sentinel method. When I encountered this method, it was using a sentinel that sorts first. Matt describes the same thing in Data Compression Explained:

    Recall that a suffix array (SA) is a list of pointers to the suffixes of a string in lexicographical order. Thus, the SA for "BANANA$" is (6,5,3,1,0,4,2), where "$" precedes all other characters. The BWT is calculated by outputting the characters to the left of each of the pointers, e.g. "ANNB$AA".
    If you apply that to ABCABC, you'd get "CC$AABB". If you leave in the sentinel, then you can omit the index. The last non-sentinel symbol winds up first, as it does in the BWTS.

    Assigning the sentinel to sort last is equally valid. I've even seen it placed in the middle.

    Is that it? Or am I missing something else?

  23. #23
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    your missing a lot of something not sure I can help you!

  24. #24
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by biject.bwts View Post
    your missing a lot of something not sure I can help you!
    Are you being serious? Could you at least point me to toward it? It looks like the sentinel method you're talking about is a little different than what I've encountered. You said the sentinel gets dropped. Is it a lot different from the sentinel method I know of?

    I apologize if something I wrote came out wrong. Was it when I wrote that the sentinel method is the BWT of the File+Sentinel? I consider you an expert. I'm basically a newcomer to compression, and especially this board. I know just enough to be dangerous. I was hoping to basically think out loud and maybe get some feedback. Thanks for the info you've already posted here.

    I've tried thinking of other ways to permute a sequence and get something bijective like the BWTS, and I came up empty -- it looks like the BWTS might be the unique solution. It's pretty sweet that you nailed it. I'm wondering where the idea to use Lyndon words came from? I had not heard of them before I encountered the BWTS. I know very little of combinatorics and group theory, but there seems to be a huge body of theoretical work out there. In this paper, Crochemore references a bijection due to Gessel and Reutenauer, but it's a bit over my head. Crochemore seems to be describing something similar to the BWTS, but I can't grasp enough to know if it's the same.
    Last edited by nburns; 5th April 2013 at 02:54.

  25. #25
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Look the index values and so called sentinals can be done several ways. I have seen some people use .5 for the EOF symbol in binary files.

    files "ABC" "BCA" "CAB" "CBA" "ACB" "BAC"


    BWT in same order as above using indes 0,1,2 as start row of orginal string is moved to
    "0,CAB" "2,CAB" "1,CAB" "1,BCA" "2,BCA" "0,BCA"

    Now BWT + HIGH VALUE SENTINAL same input files using $ for SENTINAL
    "$ABC" "C$BA" "CA$B" "BC$A" "$ACB" "B$AC"

    Now BWT + LOW VALUE SENTINAL
    "C$AB" "AC$B" "BCA$" "ABC$" "B$AC" "CB$A"

    Now BWTS
    "CAB" "ACB" "BAC" "ABC" "BCA" "CBA"

    This example not typical but BWTS works on all files both directions.
    But look at "A$BC" no reverse.

    If errors I will change by editing its late and had a few german beers

  26. #26
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Here is how to compress many files shorter using current BWT or BWT plus Sentinel Where the sentinel big small or in between.

    Long before I settled on which bijective transform to call BWTS I was troubled by the INDEX field needed with standard BWT. So I decided to bijective combine the offset or pointer to a location in the data bijectively. Many falsely assume that using some sort of universal number does the same thing but it clearly does not. You just have to open your minds and think.

    Back at least a decade ago I have method on http://bijective.dogma.net/ where I do this combining to give a better result then current methods based on the following.

    Think of BWT has giving a transformed set of data with an index field
    where you compress the data and pass index as final compressed product.

    If you compress such that two data files that differ only by one being longer
    than the other and when compression occurs the shorter data file does not compress to a longer length than the longer data set. Then using my combiner you will always get the same or better compression than not using it.

    The example I give is from the web site. (thank you Mark Nelson for not deleting it) There is a file 137 bytes long I do a BWT marks modifed code and get a file 141 bytes long. That file really consists of a 137 byte BWT and an index 4 bytes long but could easly have been written as a one byte index and the 137 bytes of transformed data. Next do by bijective joining of the file and index you get a file that is 137 bytes long in total. Then if we use a one byte filed since that is the shortest possible it becomes one byte of whatever followed by 136 bytes of data that exactly match the data from the nonbijective joined form above. Note if one normally used 4 byes for index you would then have a 133 set of data to compress.

    I am showing you this since many think there is no difference in compression using bijective BWT versus some so called standard. And the truth is its very small and often BWT would beat BWTS. However using the method of BWT that Matt or others used you can bijectively combine the index with the data. All this really does is that it leaves identical to what they had except it may be a few bytes shorter but never longer. (again assuming shorter file compresses same or smaller than longer file that has shorter as front part).

    There is another advantage of using this. Say you are doing some sort of BWT compression thing followed by encryption. If the enemy never notices that you changed the way BWT is handled. A fast check for the enemy is to see if the index is in range of the data when a key or set of keys is tested. If you use my old way that check would have been useless.

    Note the code would need to be recompiled to be used by you since written for DJGPP GNU DOS so good luck. Also it would have to be extended to use with files that are much longer today.

  27. #27
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Thanks for replying. I can't quite follow everything you wrote -- I'll have to come back to it. I'm interested in understanding the combining method. The way I conceptualize the BWTS is that, rather than append a sentinel that sorts before everything else, instead you just find the place that naturally sorts before everything else and attach that to the end. Then you do that recursively. I'd like to understand it from other perspectives, too.

    On the link you posted, I'd like to read more about the bijective LZW.

    I am showing you this since many think there is no difference in compression using bijective BWT versus some so called standard. And the truth is its very small and often BWT would beat BWTS.
    My thought is that the BWT and BWTS are separated by O(lg n) differences... at some point, I'm going to look into this more.

    I like the fact that it doesn't require metadata or have any excess arbitrary parameters. It has a simple API. You don't have to choose how many bits to give the index, or bother encoding the index with some universal code.

  28. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    BBB uses "real" BWT without a sentinel. I didn't have to do it that way, but at the time it seemed easier. Of course it is more efficient to stop the comparison at the end of the block instead of wrapping around. It would also have been more efficient to use something like libdivsufsort instead of quicksort, which is really slow if there are long matching strings in the input.

  29. #29
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    BBB uses "real" BWT without a sentinel. I didn't have to do it that way, but at the time it seemed easier. Of course it is more efficient to stop the comparison at the end of the block instead of wrapping around. It would also have been more efficient to use something like libdivsufsort instead of quicksort, which is really slow if there are long matching strings in the input.
    Interesting I forgot you did that. I wonder how fast it would be if you used Yuta's later methods for a much faster version. I know his is not the true BWT like you used but that or even BWTS is not really going to make more than a few dozen bytes different in compressed size. And no I am not the one to do that I have more and more I want to do everyday that will never get done.

  30. #30
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    BBB would be a lot faster with libdivsufsort but I never got around to rewriting it. I did make BWT one the compression methods in zpaq (-method 3), but I didn't incorporate the slow (low memory) modes. Jan Ondrus modified BBB as a zpaq preprocessor and rewrote the decompression algorithm in ZPAQL as a config file. It's called bwt_slowmode1.zip at http://mattmahoney.net/dc/zpaq.html

Similar Threads

  1. Why does BWT work on images?
    By m^2 in forum Data Compression
    Replies: 6
    Last Post: 21st September 2012, 03:46
  2. BWT alphabet reordering
    By Matt Mahoney in forum Data Compression
    Replies: 9
    Last Post: 3rd September 2011, 23:49
  3. Alphabet Reordering for BWT
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 31st May 2009, 03:32
  4. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 03:01
  5. BWT/ST compression
    By encode in forum Data Compression
    Replies: 60
    Last Post: 25th June 2008, 07:21

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •