# Thread: Is the BWT an FFT-style transform?

1. ## Is the BWT an FFT-style transform?

People have observed that the BWT is similar to the Fourier Transform (aka FFT, or, really, the DFT. I'll call it the DFT). I didn't know much about the DFT when I came across that statement, so I wasn't sure how deep the analogy went. It could resemble it only superficially, or there could be deep connections between them. Given that the DFT is based on very old math, and it has gotten lots of study, there could be a lot of insight in finding connections between it and the BWT, and might point the way to other kinds of transforms with uses in compression, and efficient ways to implement them.

I've been looking into the DFT lately, and I didn't think so at first, but now I think the BWT can be interpreted as being a transform in the same family as the DFT.

You can interpret data such as text as the coefficient vector of a polynomial with coefficients over a finite field mod 256. Alternately, you can interpret it as the point values of such a polynomial. The polynomial interpretation makes it applicable to finite field versions of the DFT. Text isn't made of sine waves -- at least, not in any way I know of -- so the usual sine waves have no advantage as basis functions, but lots of functions can be used, as long as they have specific properties. I've noticed that the functions corresponding to different "frequencies" are basically the same coefficients under different permutations (this isn't always 100% true, but it's true when the number of samples is prime). The coefficient vector itself can be interpreted as a finite ring or field, where the elements are the set of polynomials that are rotations. Enumerating the elements of this ring in lexicographic order gives the same order as the BWT (specifying it that way gets away from sorting, which doesn't fit in to the DFT). The symbols corresponding to the BWT can defined to be remainders from polynomial division.

I haven't worked out the all the details of how to formally make the BWT into a Fourier transform, but there seems to be a lot to go on, and there is probably at least one way to do it. Interpreting text as polynomials can be a useful way to think about compression algorithms, with connections to ring theory and field theory. For example, the way LZ77 breaks down text into duplicated fragments is related to factoring a polynomial. There may be more systematic ways of factoring that come out of existing theory.

This is something I've been thinking about. I'll use this thread to post whatever interesting things I find.

Edit: I realized that if you enumerate the rotated polynomials in order, that's pretty much all of it. Just drop all but the last term and it's done. So that trivializes it. Any way you look at it, the BWT just permutes, it doesn't mix together all the terms like the DFT, so it's simpler. I think there's fertile ground in the space in between them, though, and a good place to look for ideas. 2. Originally Posted by nburns People have observed that the BWT is similar to the Fourier Transform (aka FFT, or, really, the DFT. I'll call it the DFT).
Any sources? Because I don't see how the two are related, except on a metaphorical ground (both reduce correlation by transforming into a better basis). Originally Posted by nburns You can interpret data such as text as the coefficient vector of a polynomial with coefficients over a finite field mod 256.
You certainly can, but that still doesn't make a good analogy. The DFT is a source-independent transformation. It is the same linear transformation no matter what the input is. The same does not hold for BWT. The permutation that is done depends a lot on the input and is driven by it.

Well, if you get any interesting ideas out of this, please let us know. 3. The only thing I see they have in common is O(n log n) complexity (for unrelated reasons). 4. I have thought about this long and hard. I don't know if they are related but it reminds of this guy see url. I watched has speech and read his ideas about numbers and such. I kind of think in some peoples minds the BWT DFT could be related. But its nice everybody does not think the same way.

http://www.nydailynews.com/life-styl...icle-1.1340923

I don't have all the urls on this guy but he is very interesting. And see things and relationships that I don't find useful yet his mind does. It would be nice if he got into this field I would like to know his thoughts. 5. Originally Posted by thorfdbg You certainly can, but that still doesn't make a good analogy. The DFT is a source-independent transformation. It is the same linear transformation no matter what the input is. The same does not hold for BWT. The permutation that is done depends a lot on the input and is driven by it.
I think you're right. In other words, sorting isn't linear, right? 6. Originally Posted by Matt Mahoney The only thing I see they have in common is O(n log n) complexity (for unrelated reasons).
Suffix sorting is O(n), so the BWT should be O(n), also. 7. You're right, suffix sorting is O(n). So they are completely unrelated I guess. Unless you take random array access to be O(log n) in hardware. 8. One thing I was thinking about is converting the input to a sparse representation, like BWT+MTF, and then transforming it via the FFT, and undersampling. I think you can recover the original from a small number of samples. I think this is how compressed sensing works. It could be an alternative to arithmetic coding. 9. Let me know how that works out. 10. Originally Posted by Matt Mahoney Let me know how that works out.
I'm not exactly sure how efficient it would be, either in terms of compression or computation, but I think it should be possible in theory. Each computed coefficient of the DFT (or one of the number-theoretic variants) is like a hash, and the full transform is like a complete set of hashes that fully describes the input. A subset of coefficients would describe many possible inputs, but they can be ordered by some constraint that corresponds to sparseness, and then there is a unique minimum. You could include enough coefficients to ensure that the correct untransformed input is the minimum. Finding that minimum from sparse samples is, as I understand it, the problem that compressed sensing solves, and it uses a dynamic programming algorithm. So I think the knowledge for how to do this is out there. I don't know enough, yet, though.

I'd like to get a sense of how worthwhile it would be before charging ahead full steam, unless it turns out to be interesting and fun for its own sake. 11. The point of BWT is to bring together symbols with similar contexts so that you can compress the data with a low order, fast adapting (and low memory) compressor. I don't see how FFT would help on a signal that has only local correlations, or even on any symbolic (non-numeric) signal. 12. Originally Posted by Matt Mahoney The point of BWT is to bring together symbols with similar contexts so that you can compress the data with a low order, fast adapting (and low memory) compressor. I don't see how FFT would help on a signal that has only local correlations, or even on any symbolic (non-numeric) signal.
There could be nothing there to find. I'm not prepared to argue persuasively otherwise. 13. ## Since at best compression is based on the hope of making certain files of hopefully useful structure smaller. And the best general compressor for all files one can do no better than a flat copy. You could write the reverse first. Find something that compresses nice
FFT's small. Then look at the files that result when you uncompress and un-FFT. Take a good look and then say you have designed a
great compressor for that class of files. 14. Originally Posted by biject.bwts Since at best compression is based on the hope of making certain files of hopefully useful structure smaller. And the best general compressor for all files one can do no better than a flat copy. You could write the reverse first. Find something that compresses nice
FFT's small. Then look at the files that result when you uncompress and un-FFT. Take a good look and then say you have designed a
great compressor for that class of files.
Then I'll convince the world to use those kinds of files. Good idea I assume that if the FFT itself was good for text, someone else would have noticed a long time ago. It would probably have to involve a tailored set of basis functions, or some kind of additional processing. Or a different transform entirely. 15. Originally Posted by nburns I assume that if the FFT itself was good for text, someone else would have noticed a long time ago. It would probably have to involve a tailored set of basis functions, or some kind of additional processing. Or a different transform entirely.
Actually, it is the BWT that works well on natural text. FFT is ideal for numerical data whose correlation matrix is position-independent. 16. Originally Posted by thorfdbg Actually, it is the BWT that works well on natural text. FFT is ideal for numerical data whose correlation matrix is position-independent.
The goals are not quite the same, either. The BWT was designed for compression, and the FFT measures the contributions of a set of functions. The BWT doesn't really measure anything.

One thing from DSP that seems relevant is autocorrelation. I don't know how to use it for compression, but it doesn't seem that far removed. 17. Maybe do FFT on statistics as an enhancement of Context Mixing? :P 18. Originally Posted by Piotr Tarsa Maybe do FFT on statistics as an enhancement of Context Mixing? :P
I don't know how that would work, since I haven't done much with CM, or whether you're even being serious. But one idea would be to do a transform before CM, like the BWT. Another idea would be to somehow use the transform to encode the ranges instead of arithmetic coding.

Since the DFT is bijective (if done carefully, at least), it can be used to represent anything that can be represented with bits. It just transforms data into a different domain. Since numbers are usually represented as coefficients to a polynomial that is evaluated at the number base to give the value, and the inverse DFT effectively evaluates a polynomial at a set of complex points to represent the polynomial, I think it's fascinating to think of it as a kind of number system. Unlike binary or decimal, it doesn't have a big end or a small end. It spreads all of the data equally across the entire number. Now, transforming a number into a different domain doesn't compress it; it generally uses the same number of bits. So the transform inherently contributes nothing to compression. But the alternative domain might make something easier or something possible which wasn't in the original domain. 19. What I thought is that for example if we, during doing CM, take a vector describing the probability of current bit in consecutive orders then values in that vector should be somewhat correlated, thus FFT should fit them better than it fits consecutive symbols in usual decently compressible file.

Remember that BWT, LZ, PPM or CM are mostly insensitive to input alphabet permutation, ie the exact symbols codes don't matter much. Fron what I know about FFT, permuting input symbols could hugely affect FFT output. But if you look at statistics (in CM or PPM compressor) after permutation then they are generally the same (of course they are permuted also but correlations are mostly the same if you compress one symbol at a time), so FFT is greatly different from usual transformations. 20. Originally Posted by Piotr Tarsa What I thought is that for example if we, during doing CM, take a vector describing the probability of current bit in consecutive orders then values in that vector should be somewhat correlated, thus FFT should fit them better than it fits consecutive symbols in usual decently compressible file.

Remember that BWT, LZ, PPM or CM are mostly insensitive to input alphabet permutation, ie the exact symbols codes don't matter much. Fron what I know about FFT, permuting input symbols could hugely affect FFT output. But if you look at statistics (in CM or PPM compressor) after permutation then they are generally the same (of course they are permuted also but correlations are mostly the same if you compress one symbol at a time), so FFT is greatly different from usual transformations.
I think what you are talking about is something like this:

Use CM to predict a bit, output (predicted bit XOR actual bit), advance and repeat until EOF.

Then you'd be left with a 1 everywhere the prediction failed and a 0 everywhere else, therefore all of the predictable patterns would vanish.

Maybe what you're describing is somewhat more complicated? But I think that is more or less it.

There still remains the question of what the FFT is being used for. The transform I described would yield a sparse bit vector, and you could use the FFT to encode it using a process like I described in #10. 21. I wasn't clear enough.
vector describing the probability of current bit in consecutive orders
I've meant here vector comprised of that elements:
- probability of 1 in order-0,
- probability of 1 in order-1,
- probability of 1 in order-2,
...
- probability of 1 in order-<max-available>,

Today's CM-based compressors use neural networks to combine those probabilities, but what I've suggested is to throw FFT somewhere as those probabilities should usually be somewhat correlated. At least should be more correlated than consecutive symbols in files we're losslessly compressing. 22. Originally Posted by Piotr Tarsa I wasn't clear enough.

I've meant here vector comprised of that elements:
- probability of 1 in order-0,
- probability of 1 in order-1,
- probability of 1 in order-2,
...
- probability of 1 in order-<max-available>,

Today's CM-based compressors use neural networks to combine those probabilities, but what I've suggested is to throw FFT somewhere as those probabilities should usually be somewhat correlated. At least should be more correlated than consecutive symbols in files we're losslessly compressing.
Okay, I didn't get what you meant before, but now I think I do.

So if M is the number of orders, assuming it's constant, then the entire file becomes an N x M matrix -- and you'd like to use the FFT on the M probabilities that feed into each bit.

The FFT doesn't really combine values, it transforms a k-value sequence to another k-value sequence. The FFT could maybe serve some purpose, but I'm not sure I see where you're going -- but I'm interested.

Edit:

I don't pretend to understand modern CM, but I think this is the situation --

Each order k is defined by the trailing k bits of context. So the probability of 1 in order 3 is the proportion of 1s that followed the same 3-bit trailing context as the current position. The various orders are definitely correlated then: The matching 3-bit contexts are a subset of the matching 2-bit contexts, which are a subset of the matching 1-bit contexts, etc. The same situation arises in PPM, doesn't it? Except that PPM escapes down to the longest context that has a match, rather than mixing all of the contexts together.

A problem with a transform like the FFT is that the order number doesn't necessarily have much of a meaning, other than to denote the different subsets. In some contexts, order 5 could be the most useful context, but somewhere else, orders 4, 5, and 6 could have identical probabilities and be redundant. The issue is that bits don't carry equal amounts of information. So order might not make a good axis to define a function over.

-----------------

Somewhat related: Does CM make use of Bayesian inference at all? I've played with an algorithm called Re-Pair, which is a bit like LZ77 and dictionary methods in general, but rather than reference arbitrary stretches of text, it pre-factorizes the entire text into a separate dictionary and rewrites the text as a string of references to the set of phrases in the dictionary. Consequently, a dictionary reference is a single number, not a (distance, length) pair. The compression is so-so, but decompression is very fast. Re-Pair uses a greedy strategy to produce phrases by combining pairs of existing phrases. I think a major problem with the greedy approach is that it makes a lot of suboptimal choices, like attaching spaces to the beginning of some words and the end of other words, more or less randomly. A better factorization would probably avoid attaching spaces to words, unless it's making a multi-word phrase. I think a scoring system using the Bayes theorem might help. This is not necessarily directly relevant to this conversation, but I think the problem of finding a good static factorization might be one worth solving. 23. I think autocorrelation would be the most accurate way to detect files with fixed-size records. Otherwise, I'm out of ideas for how to apply DSP techniques in lossless compression. I started looking into it out of an interest in hashing, and I think it's useful to look at hashing through the lens of DSP. But as lossless compression goes I'm moving on to other ideas. 24. I answered my own question from two posts ago. I came across one of Matt's papers where he's applying Bayesian inference in pretty much the way I imagined: http://cs.fit.edu/~mmahoney/dissertation/lex1.html

...at least I think so. I can't quite seem to follow that right now. I'm a little sleepy. 25. Actually that is an experiment I did around 2000 to see if language was structured so that it could be parsed into words from just an n-gram character model. This allows a language model implemented as a deep neural network to be trained efficiently one layer at a time bottom up: phonetic, lexical, semantics, grammar. 26. Originally Posted by Matt Mahoney Actually that is an experiment I did around 2000 to see if language was structured so that it could be parsed into words from just an n-gram character model. This allows a language model implemented as a deep neural network to be trained efficiently one layer at a time bottom up: phonetic, lexical, semantics, grammar.
I've been thinking about parsing, and I have an idea for an optimal parse algorithm that attempts to find the boundaries that naturally emerge from the text, without being LZ-specific or anything-specific.

It finds the optimal complete parenthesization (/ binary parse tree) given a cost function. The algorithm for optimizing a complete parenthesization is in CLR. The cost function I'm thinking of assigns the cost of a set of parentheses to be 1/freq, where freq is the frequency of the enclosed substring across the whole document. The way a string is parenthesized in one place doesn't affect the parenthesization of an identical string somewhere else; they're independent. The total cost is the sum of the costs of all the n-1 sets of parentheses (/ internal nodes in the parse tree). It sort of appears as if you pay the cost of something more than once as it gets enclosed by more and more parentheses, but you don't, really. It's not a greedy algorithm; the CLR algorithm is dynamic programming and it's probably O(n^2). It might be possible to make linear on text.

Edit: Linear time might be too much to expect. What I should have said is: it might be possible to make it reasonably fast on text, and a faster approximation of the global optimum might be good enough. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•