I've been trying to come to a thorough understanding of the BWT and suffix arrays. This is a long term project. Conceptually, they are fairly simple, but there are a lot of details.
There are two basic sources of information: academic literature and forums like this. The academic literature is rigorous and well-documented. This forum has contributors, for instance BWTS, with great ideas. In one case, too much jargon and formalisms can get in the way, in the other case, too little formality is the issue. In one case, the articles are easy to find, but aren't always free, in the other case, the articles are scattered in obscure locations, or don't exist at all. I try to read both.
Here's some information that I've found on the redundancy/waste in suffix arrays and some observations I've made on redundancy in the BWT. It might not be new, but it's not exactly easy to find. This forum seems to cover mostly advanced statistical models in compression, but I'm more interested in discrete math and this seems like the right place to talk BWT.
I've seen some papers in the academic literature (here, here) that deal with counting the number of suffix arrays, and defining the subset of permutations of n that are valid suffix arrays (over a fixed alphabet, with various constraints). It's not hard to see that, for the usual case of files over byte alphabets with n >> 256, not all permutations of n (=length) can be distinct suffix arrays; in fact, most can't. The easiest way to see this is that the number of general permutations of n is n!, while the number of files of n bytes is 256^n. Since each file maps to a single suffix array, and the number of files is smaller than the number of permutations of n (for increasing n), there must be permutations that are not suffix arrays. Another way to see this is that there is a correspondence between suffix arrays and BWTs. Assuming that a BWT is unique to a particular file, then a BWT must correspond to a single (not necessarily unique) suffix array. In general, the number of distinct BWTs is an upper bound on the number of suffix arrays. BWTs of sufficient length always contain repeated characters, and since repeats are indistinguishable, the number of distinct suffix arrays of a file in which some character c repeats k times must be limited by n!/k!. So that gives insight into a tighter bound on the number of suffix arrays: the number is not nearly the number of general permutations of n, it's bounded by the much smaller number of ways to arrange a multiset (set with repeats) of size n.
Considering that there are permutations of n that are not valid suffix arrays, I wondered if there are arrangements of characters that are not valid BWTs. The answer is yes. By the way the BWT is defined and used, a BWT must describe a permutation with exactly one cycle. Walking all the way through that cycle yields back the original file. But not every arrangement of characters, when treated as a BWT, describes a single cycle. Consider this example, with the BWT on the left and the sorted sequence on the right:
c . a
b . b
a . c
This "BWT" contains two cycles. Starting at c in the first row, you can never get to b in the second row. Starting at b, you keep coming back to the same place and never get to either a or c. So the sequence "cba" can't be a valid BWT.
There are lots of invalid BWTs. No sorted sequence of characters can be a BWT:
a . a
a . a
b . b
In general, a sorted sequence, treated as a BWT, will contain as many cycles as letters.
Since the number of BWTs is strictly smaller than the number of ways to arrange the same letters/characters in an arbitrary sequence, the BWT must be redundant. It turns out that, at least in some cases, not all the letters are necessary. The following sequence is drawn from the 3-letter alphabet abc:
The missing letter in the bottom row has to be "a." Either b or c would make the last letter a self-cycle.
The loss in information going from a general sequence with arbitrarily many cycles to a single cycle would generally be a lot, but I think much of the difference gets absorbed by the repeating characters, i.e., not all partitionings into cycles would be possible in reality.
It seems intuitively clear that if the BWT was a 1:1 mapping from a sequence to a permutation of that sequence (i.e. a bijection), for all sequences, the information would be perfectly balanced. The permutation would neither add nor subtract information; it would have no more redundancy than the source. The BWT loses information about the starting position, which has to be resolved by adding something external. The amount of information in the start position must account for the lost/redundant information in the BWT itself. The start position adds at most lg n bits (actually, somewhat more, since the number needs its own encoding). The requirement that the BWT sequence must be a single cycle limits where it can end; for instance, it can't end with the largest character in the sequence. So it encodes within it some information on its own length. But it's not self-terminating in the general case. So you can't eliminate the length information, either.
Scott, of the BWTS, has dealt with the non-bijectivity of the BWT, and has a solution. The BWTS is looking good to me as perhaps a better replacement for the BWT. What I have not come across is a lot of documentation, and I'm not sure whether or not the bijection has received a formal proof. I'd like to refine what I know about the BWTS, and, maybe, the proof is not that hard.