# Thread: Reducing the header - optimal quantization/compression of probability distribution?

1. ## Reducing the header - optimal quantization/compression of probability distribution?

The header of Huffman coding can be relatively small - contain just lengths of bit sequences corresponding to succeeding symbols: approximations of probabilities using powers of 1/2.
However, as the speed of accurate entropy coding of large alphabet has increased a few times in the last few months (~7? for Range coding vs FSE/tANS), there appears a problem of storing more accurate probability distributions - a tradeoff with compression rate.
So I wanted to start a discussion about how to it well and fast to optimize the total size of file: header + compressed data.

For the beginning, let us look at cost of approximation - if we use entropy coder optimal for {q_s} distribution to encode {p_s} distribution, we loose on average Kullback-Leibler difference bits/symbol:
deltaH=D(P||Q) = \sum_s p_s lg(1/q_s) - \sum_s p_s lg(1/p_s) = \sum_s p_s lg(p_s / q_s) ~ 0.72 \sum_s (p_s-q_s)^2 / p_s
where the last formula is from expanding logarithm to second order around 1.

So generally the smaller probability, the more accurate it should be given - one approach can be: divide the set of possible probabilities into growing disjoint ranges, represented by (~)the central point, such that each of them have "at most" or "on average" similar impact to the sum above.
However, there is an issue with the last symbol in this approach - normalization sum_s q_s=1.

So I thought that maybe we should start with normalization(?): length 1 interval, and encode the cumulant function - positions of divisors between succeeding symbol (m-1 for size m alphabet). There is a fast way to do it nearly optimally I was told by James Dow Allen (6th page) - divide the interval into let say m buckets, use unary system to encode the number of elements in every bucket (cost: 2bits * m), then eventually encode succeeding bits of their expansion.
In our case the amounts of these bits should depend on sizes of intervals between them - for example going layer by layer, we give succeeding bit if there is a neighboring element for given approximation level, up to the maximal resolution - such that the decoder knows when to expect succeeding bit for given element: Do you know some good algorithms for this optimal header size problem? 2. ## The Following User Says Thank You to Jarek For This Useful Post:

Cyan (23rd February 2014)

3. I would also see this problem as abstract, but it was motivated by the Brotli team - they switched Huffman with ANS, getting better compression rate ... which turned out to be worse when adding size of more accurate (increased) header ...
The natural solution is to just increase the block sizes, as relative header size asymptotically disappears. But sometimes this size can be really large, especially if we would like to store many distributions, like for order 1 method ...

To make this problem independent from message size, we can for example connect deltaH with header size in bits/symbol.
I have made such graphs for the method from my previous post: first encode the number of dividers in buckets using unary code, then layer by layer give succeeding bits if "necessary": there is a neighboring node.
The question is what probability distribution should we use for such tests. For parametric ones it doesn't make sense, as we could just write the parameters.
So the tests below are for "random distribution" using generation procedure: take "alphabet size-1" random numbers from uniform distribution on [0,1] and use them as cumulant function: sort and probabilities are differences.

I have checked 64, 256 and 1024 alphabet sizes - surprisingly there is large universality among them: Huffman gives always deltaH \in (0.025,0.03) bits/symbol. It would need ~3-4 bits/symbol to store its approximated distribution.
The basic mentioned method required about 5 bits/symbol and gave deltaH ~ 0.01. Then I have increased the distance defining neighbor: 2 to 10 times, getting plots below (very similar for all alphabet sizes) - so we need about 6.5 bits/symbol for deltaH~0.001, and 8 bits/symbol gives nearly perfect accuracy.
Here are graphs - about 10 for each alphabet size (and used Mathematica source): Any ideas for better (fast!) algorithms?

-----
Update - repaired broken dropbox links - Mathematica souce: https://www.dropbox.com/s/di7kobrt1qb3nuj/header1.nb
Image from previous post: https://www.dropbox.com/s/ylkmjnzehr9qnln/header.png  4. ## The Following User Says Thank You to Jarek For This Useful Post:

Cyan (24th February 2014)

5. It's a very interesting topic Jarek.
Table Header size represent typically a small portion of total size, and as a consequence, don't get that much attention for optimization.
So I feel there's ample room for progresses.

If there is a way to ensure a header size of at most 7 bits/symbol with a controllable and guaranteed deltaH, then I feel it's a good deal.

I'm not sure however to fully understand your proposed algorithm.
I feel the high level principles are okay,
it's the exact implementation for a simple example which is missing.

For example, let's imagine we have a simple distribution of 4 symbols over 1024 as follows :
667 / 311 / 45 / 1

How would you deal with it ?

Regards 6. This has been done many times for one thing it is a bijective problem. If the compression is not bijective then you don't have an optimal header compression!!! 7. Basically you are encoding a bit vector, right? For table size L and alphabet S, the bit vector would have L zeroes and count(S)-1 ones, with ones used to indicate the boundaries between ranges.

If you make no assumptions about distribution, then the optimal encoding would be to rank all the possibilities and write out the rank as a single number between 0 and (count(bits) choose count(ones)) -- enumerative coding. The size of the number would probably make this impractical.

Your method sounds like one in a paper by Saddakane that was meant as a succinct data structure for rank and select (I can find it if you're interested). I came up with a tree-based succinct rank data structure of my own and it turned out to use the same space as RLE+gamma to within 1%. It wasn't fast enough, though.

Have you considered using ABC to compress the bit vector? ABC (and ANS) are like approximate enumerative coding. 8. Originally Posted by biject.bwts This has been done many times for one thing it is a bijective problem. If the compression is not bijective then you don't have an optimal header compression!!!
In order for it to be bijective, the number of possibilities would have to equal 2^n for n bits. You'd have to change the problem, because it doesn't naturally fit a power of 2. 9. Hi Cyan, biject,

Indeed this is important and interesting basic problem, and so it should studied before.
I have found for example QUANTIZATION OF DISCRETE PROBABILITY DISTRIBUTIONS by Yuriy Reznik, but it says that we should use uniform quantization, while I am certain that low probabilities should be represented with better accuracy.

The algorithm I have tested (from the picture) is:
for m size alphabet we would like to encode m-1 dividers: p[s]=cum[s+1]-cum[s] (cum=0, cum[m]=1).
First we divide the [0,1] range into m identical subranges and for each of them encode the number of dividers inside using unary coding (e.g. 0 dividers a s "0", 3 dividers as "1110") - it costs almost 2 bits/symbol.
Then we need to improve accuracy when there are multiple dividers nearby - defining maxdif=1/m, if there is some divider in at most this distance from given one, read one more bit to get more accurate position for this divider.
And so on for succeeding layers - every time with twice smaller maxdif.
That was the original algorithm, then I have increased a few times the initial maxdif - getting better accuracy at cost of more bits/symbol (graphs).

I think smaller header size can be obtained by simpler algorithm: take a growing family of disjoint ranges of probabilities, represented by (~)central point - such that each such range have similar impact to deltaH~sum_s (p_s-q_s)^2/p_s.
Now for every symbol choose probability as this central point of range it is in - as they should have the same impact, it requires lg(their number) bits/symbol.
However, such method has issue with normalization - such quantized probabilities usually will not sum to 1 (until e.g. they are powers of 1/2).
Normalizing them will spoil accuracy a bit, is costly (... and dirty).

Regards,
Jarek

nburns, indeed just entropy coding seems a bit better than this unary coding method (16*h(1/16)~5.4 bits/symbol for L=2^12 and 256 size alphabet) - asymptotically it needs 2-lg(e)~0.56 bits/symbols less.
However from one side it still seems more expensive (ok, it is pre-ANS thinking), and there is also better similar approximation needing only lg(3)-lg(e)~0.14 bits/symbol more than optimum (James Dow Allen):
instead of unary coding, encode: "0" - 'zero elements', "1" - '1 element', "2" - '2 or more elements' (1 trit = lg(3) bits per bucket/symbol) - then encode suffixes (further bits) in s_1<...<s_{k-1}>s_k order for "2 or more case" - the decrease denotes the last element.

But the reason I haven't used it is that accuracy here should be larger for small probabilities - we need to adaptively modify accuracy of divisor position, depending on distance from neighbors - how to do it fast and effectively? 10. I modified fpaqc to use a fixed prediction of Pr(1)=n_ones/n_bits, and it gave me 184 bytes for a 256-symbol alphabet and L=4096 (ranges were random). (I tried using a dynamic prediction, but I couldn't get it to work.) That's 5.75 bits/symbol. 11. I don't have a specific algorithm, but there are some observations of the extremes which are interesting to consider in a byte based ANS style system (ie alphabet=256). It seems that no one solution will work best for all scenarios.

1) 256 symbols. Every character is represented. In this case we need 256 frequencies encoded. If we store frequencies in alphabet order then we don't need to encode what alphabet symbol we're encoding. If we store by frequency order (eg most frequent first) then we can shrink that data, but we also now need to store an additional 7-8 bits per value to indicate the symbol.

2) 2 symbols. It's wasteful to encode 256 frequencies, so we're almost certainly going to be better off specifying which 2 symbols are present and what their frequences are.

3) 128 symbols - the middle ground. Do we want to store all the symbols including the zeros, or just the 128 present and indicate which they are? Are the 128 adjacent to each other or spread? (Personally I use store symbols in alphabet order and use an RLE style system to indicate which are present, so ascii text may have 9, 10, 32-126 (for example).

4) 10 symbols, spread apart. OK so we're going to have to store 10 symbols + 10 frequencies. It doesn't look like we can optimise out indicating what the symbols are too well. Maybe in this case it's best to sort by frequency and use that to shrink the frequency space.

For example if we know our normalised sum is 65536 (due to the nature of a specific algorithm) and we're storing the most probable symbol first then we have any value from 1-65526. Let's say it was 40000. We then know we have 9 remaining symbols, with a possible range from 1-25527. Does that help? Maybe.

Quite clearly we have two distinct problems here. How to indicate what symbols we have and how to specify what their frequencies are. We can optimise each independently (clearly sorting by alphabet makes the symbol encoding smaller, while sorting by frequency makes the frequency encoding smaller), but it's not so clear how we optimise this.

Maybe we need a public test set of frequencies (eg a normalized set from a chunk of enwik9) to play with. Without that I have no idea what the "184 bytes for a 256-symbol alphabet" really means. Is it good? Bad? Depends on the data! Attached is the first frequencies (normalised to 4096) to come out of FSE when run on enwik9. FSE uses 141 bytes (I use 147).

PS. More complex is how to store order-1 frequencies too. Clearly there will be some correlation effects which ideally we'd like to exploit. 12. ## The Following User Says Thank You to JamesB For This Useful Post:

Jarek (24th February 2014)

13. Originally Posted by nburns In order for it to be bijective, the number of possibilities would have to equal 2^n for n bits. You'd have to change the problem, because it doesn't naturally fit a power of 2.
That's not true at all!! The fact is when the Header ends the data begins if there is the need for extra data at all. The thing is there is nothing about being an
optimal bijective compression that would force this constraint that you are adding. 14. Originally Posted by biject.bwts That's not true at all!! The fact is when the Header ends the data begins if there is the need for extra data at all. The thing is there is nothing about being an
optimal bijective compression that would force this constraint that you are adding.
I suppose you are right. I was making unnecessarily simple assumptions. 15. Why don't we get someone we trust to test things like this. Create a random set of many files say for 3 symbols that are created by a random generator that creates a set of I I D files for some random probabilities to say randomly between 100 to 200 characters in lenght. Short files best since concerned about the effect of a so called header length plus data and the data part is trivial since its well defined. Then many of could create a program to compress and uncompress the files creates for some of the runs. To see which method works best.
That why we can test several methods and when this is done move to more symbols or whatever. 16. James, I also thought about sorting frequencies first, but storing the order costs at least lg(256!)/256 ~ 6.58 bits/element, what is relatively huge.
Regarding order 1 method, it is basically taking frequencies of pairs (S_ij =p_ij / sum_k p_ik) - finding deeper relations is hard and strongly dependent on data type...

biject.bwts, how does bijection help here??
If e.g. there are p "1"s and you know this p, it means that you need to specify one of binomial(n,pn)~2^{nh(p)} possibilities.
Surprisingly, we don't need to know this p in 'reversed problem', but in standard entropy coding we rather need this p to restrict to sequences of given statistical properties - bijection will not magically help here, especially that entropy coders are already nearly bijective - especially enumerative coding.

Let's go back to the simplest, but probably the best method (?) - dividing the space of probabilities into disjoint intervals, represented by it centers. Huffman can be seen this way - the centers are of type: 1/2^k.
James Dow Allen has just proposed to expand the Huffman set of centers to {1,3}/2^k (or {1,3,5,7}/2^k), what requires 1 (or 2) more bits than Huffman and still allow that encoder ensure normalization.

Let us think how to choose this kind of methods from the perspective of deltaH ~ 0.72 sum_s (p_s-q_s)^2/p_s approximation.
Such sum of squares should be minimal if all impacts are equal - let say (p_s-q_s)^2/p_s<=C for all s.
Then deltaH <= 0.72 nC and E(deltaH)~0.72 nC/3 where n is alphabet size.
Let i-th range be (r_i, r_{i+1}) represented by its center: (r_i + r_{i+1})/2.
For given r_i, we would like to find r_{i+1} such that (p_s-q_s)^2/p_s<=C is realized on the boundaries.
Solving simple quadratic equation, we find that we should choose:
r_{i+1} = r_i + C + sqrt(C^2 + 4 C r_i)

Choosing r_1=1/2^12, C such that E(deltaH)~0.01 ... it turned out that r_79 was the first above 1, choosing one of them would require lg(79)~6.3 bits assuming uniform distribution, what, even neglecting further complications like normalization, seems worse than previous consideration.
However, it could be improved by using nonuniform distribution among these intervals - there cannot be too many large probability symbols, and generally there should be lots of smaller - maybe we could use some Zipf law with parameter attached in the header ...
The normalized approach (encoding dividers) - assumes memoryless choice of dividers, what means that it already have built in exponential probability distribution among probabilities. We should be able to improve a bit e.g. 16 h(1/16)~5.4 bits/symbol for L=2^12, alphabet size 256 by reducing accuracy for large probabilities and maybe modifying this exponential distribution of probabilities - the question is how to do it effectively ... 17. Originally Posted by Jarek biject.bwts, how does bijection help here??
If e.g. there are p "1"s and you know this p, it means that you need to specify one of binomial(n,pn)~2^{nh(p)} possibilities.
Surprisingly, we don't need to know this p in 'reversed problem', but in standard entropy coding we rather need this p to restrict to sequences of given statistical properties - bijection will not magically help here, especially that entropy coders are already nearly bijective - especially enumerative coding.
Sadly many enumerative codings miss the big picture and waste space but not grasping the added steps needed to make them bijective.

Let me see if I understand the problem your asking about. Are you asking for a bijective compression of the set of all files that contain exactly p ones and the rest if any
are zero? Do you want to include the null file as one of the outputs or only all non null files. I don't think its hard to make a bijection if that's what you want. It's not what
I would call the standard compression problem though.

Here would be a simple bijective transform of all strings(or files since simple bijection between them) to strings that contain exactly 20 ones.
you start transferring file as a bit bye bite till 20 ones done then rest of string just a count of excess zeros.
case one you have no ones in file to transform
you but the zeros in output till then add 20 ones.
case two you have ones but not 20 ones
you output as before than add the rest of ones that make it 20.
case three string has exctly 20 ones last bit a 1
you transfer input string to output string
case four exactly 20 or more ones and data following 20th one.
you transfer input string to outout string till you have 20 ones in it
then you use the unique bit pattern that follows as the number of trailing zeros
0 would be 1
1 would be 2
00 would be 3
01 would be 4 and so on.

The opposite transform is childs play. If you add other constrants such that the file is usually long and the ones rare. You still transfrom to the space of all
files as above and then use a simple bijective 2 state compressor on the ones and zeros in the file that is from the set of all files. 18. Originally Posted by Jarek James, I also thought about sorting frequencies first, but storing the order costs at least lg(256!)/256 ~ 6.58 bits/element, what is relatively huge.
Regarding order 1 method, it is basically taking frequencies of pairs (S_ij =p_ij / sum_k p_ik) - finding deeper relations is hard and strongly dependent on data type...
At the risk of running before we're walking, I realise now that order-1 entropy encoding could check whether the order-1 frequencies have a strong divergence. Some we know will be very strong ("u" after "q" in english text) while in other contexts the previous character may yield very little indication of the next. In those scenarios we could simply omit describing the dependent probabilities and just replace them with an "evenly distribute over the observed order-0 alphabet" indicator.

Of course the same applies to an order-0 uniform distribution, or indeed any remainder distribution. Eg if we have symbols A, B and C from an alphabet of 26 that are heavily skewed with D-Z all being pretty close to each other, then maybe we should just encode them with frequency 0 and have a rule that converts 0 to "an even split of the remainder". Or failing that, some other distribution (but it's harder as it then requires the sort order defining which again leads to larger size).

In practice I have no idea how big an impact that makes. Just how uniform are the masses? Probably not much in most data. The first 1Mb of enwiki shows: This is sorted by symbol frequency. There's a straightish chunk in the middle (note this is log scale), but no real flat sections so omitting to encode probabilities for remainder symbols doesn't really help us it looks. About the only thing important is the large number of zeros, but we already have ways of dealing with those.

In other words, it's most likely barking up the wrong tree! Just my musings. 19. Originally Posted by Jarek Hi Cyan, biject,

Indeed this is important and interesting basic problem, and so it should studied before.
I have found for example QUANTIZATION OF DISCRETE PROBABILITY DISTRIBUTIONS by Yuriy Reznik, but it says that we should use uniform quantization, while I am certain that low probabilities should be represented with better accuracy.
Ha, I should have guessed that Yurij had done something about it! Anyhow, in case you are not reading c.compresssion, here is my take (besides that I'm still not sure I understood the problem to the degree I would need to provide an optimal solution):

There is some kind of universal law for distributions, at least for decorrelated data. Some abstract nonsense shows that scale invariant distributions (as they are often obtained from decorrelated noisy sources, i.e. image residuals or prediction residuals in general) follow a "stable distribution", which is a generalized two-sided laplacian law p(x)~exp(-\beta |x|^\alpha). For simplicity, one can assume alpha=1 and makes no big mistakes in general. Use that as a prior distribution, and only encode the residuals to that distribution - which would already give you an advantage.

Another good distribution to start with would be 1/n, n = the symbol. You also get something close to that by sorting probabilities in decreasing frequencies for a lot of sources (1/n is of course something different than exp(-|x|), but the error is not so big for small alphabets, so which one you use does probably not matter).

But before that, I should probably read what Yurij has to say about it. He's a pretty smart guy (know him for a couple of years). 20. Originally Posted by Jarek However, it could be improved by using nonuniform distribution among these intervals - there cannot be too many large probability symbols, and generally there should be lots of smaller - maybe we could use some Zipf law with parameter attached in the header ...
The normalized approach (encoding dividers) - assumes memoryless choice of dividers, what means that it already have built in exponential probability distribution among probabilities. We should be able to improve a bit e.g. 16 h(1/16)~5.4 bits/symbol for L=2^12, alphabet size 256 by reducing accuracy for large probabilities and maybe modifying this exponential distribution of probabilities - the question is how to do it effectively ...
Currently, the header stores a number for each symbol in the table, and that number directly determines the frequency of that symbol in the table, right? So, you could write this as a formula, where Ft[s] means frequency of symbol s in the table and H[s] means the header entry for symbol s:

Ft[s] = H[s]

Maybe this should not be a linear function. What if you used, for some constant alpha:

Ft[s] = H[s]^alpha

A logical choice of alpha would be something > 1. 21. I don't see the point of even having a header when one is doing simple entropy compression to a file. I really think adaptive is the way to go since its one pass. Its also easier to make bijective and clearly when one looks at the set of files for a unknown IID source the sum of the compressed files will be smaller than any nonbijective way where you do more than one pass and or write a header in the front of file!!! 22. biject.bwts, enumerative coding is a bijection between combination and e.g. {1,...,binomial(n,k)} range ... but it doesn't help as you still need to know k - the question is how to encode it optimally (especially for large distributions).

James, text is a very specific type of data. Only very few of pairs have more than one appearance, so it is better just to list possibilities and they can be longer - leading to some vocabulary (or LZ-like) methods. Optimal storing a vocabulary is a different problem, but there is also the issue of storing the probability distribution among words - like here, but for very large alphabet.

thorfdbg,
The problem is to store approximated probability distribution, e.g. to minimize description (~ 4-8 bits * alphabet size) for given (average or maximal) Kullback-Leibler distance (<0.01 bits/symbol) ... and it should be relatively fast.

About general probability distributions, if it is parametrized we can just send the parameters ... however, if we would also need to send the order, it is relatively costly: lg(256!)/256 ~ 6.58 bits/element for 256 size alphabet (O(N log N)). 23. Originally Posted by Jarek biject.bwts, enumerative coding is a bijection between combination and e.g. {1,...,binomial(n,k)} range ... but it doesn't help as you still need to know k - the question is how to encode it optimally (especially for large distributions).
I never thought much of so called enumerative coding. I have seen code where one has say 7 A's and 6 B'a and then they write up to 3 numbers
one for the A's and one for the B's then a single number for the combination. I think in general such methods suck and yes you are right there is
a bijection between the combination and a finite set of numbers SO WHAT. But if you have a string of A's and B's this enumerative approach not
does not lead to a simple bijection from the set of all files so why do yo keep talking about it.

Like I pointed out adaptive may be the best way to optimally encode large distributions. Of course there is more than one optimal way depending on
what other characteristics you give to the files that you are concerned with compressing. 24. Originally Posted by Jarek About general probability distributions, if it is parametrized we can just send the parameters ... however, if we would also need to send the order, it is relatively costly: lg(256!)/256 ~ 6.58 bits/element for 256 size alphabet (O(N log N)).
If MTF is the last step before ANS, the symbols will be nearly sorted by frequency already (for BWT and similar kinds of inputs, at least). Empirically, BWT+MTF seems to give a surprisingly predictable distribution. 25. Jarek-- what do you think of my suggestion to use a non-linear function to scale header values <-> frequencies? If alpha = 2, you'd basically store sqrt(p) in the header and the decoder would square the header value to get back p. That would reallocate the precision the way you want, wouldn't it?

I think this is the exact analog of a problem that pops up in DSP in many contexts. In this case, you could say the bandwidth is too narrow to represent all the frequencies in the signal... the same problem applies to amplitudes, and you can compress and expand the amplitudes to get a better fit. 26. Originally Posted by Jarek The problem is to store approximated probability distribution, e.g. to minimize description (~ 4-8 bits * alphabet size) for given (average or maximal) Kullback-Leibler distance (<0.01 bits/symbol) ... and it should be relatively fast.

About general probability distributions, if it is parametrized we can just send the parameters ... however, if we would also need to send the order, it is relatively costly: lg(256!)/256 ~ 6.58 bits/element for 256 size alphabet (O(N log N)).
If that is really the problem (namely minimizing the KB-distance under a rate constraint), no matter whether you consider this a good or a bad model, then the answer is relatively straight. It is a quantization problem of finding a suitable quantizer Q for the distribution. This type of question has been answered a lot of times:

Define a functional J = D + \lambda R, where D is the K-B distance, and R is the rate of the header, then minimize J as a function of \lambda, then pick \lambda to match your operational point. Minimizing J requires some analysis - not that I have done that - but at least it provides you with the mathematical tools to discuss the problem. I would say, starting from this equation, you will be able to analyze the problem and find the optimal solution.

Whether this solution is then practical to implement is just another question, but for me actually a secondary problem. It is a problem finding a good approximation to the correct solution.

Actually, what Yurij does is to describe a quantizer that minimizes an l^\infty functional (a particularly ugly one), but that is not your problem. Thus, the answer is a different one and not the one you expect.

Bijectivity is not an issue here. It is not about finding the ideal compressor, but a different problem: Finding the R/D optimal description of a probability distribution. 27. Originally Posted by Jarek About general probability distributions, if it is parametrized we can just send the parameters ... however, if we would also need to send the order, it is relatively costly: lg(256!)/256 ~ 6.58 bits/element for 256 size alphabet (O(N log N)).
This is a very practical question, and it can only be answered by "it depends". For most of the problems I work with, you would not need to send the order, because the distribution matches anyhow quite close to the model. There is no need to reorder. Yet, if you want to reach a certain precision, you would need to send residuals and it would be not sufficient to send the parameters - they aren't any. Practically, you make the rule adaptive and do not send any parameters.

A good example for this is the Golomb coder of JPEG LS. It is a parametrized distribution (two-sided Laplacian, \rho^|x|), and it is easy to estimate \rho from the data you have. The model fits well, so there is no need to re-order.

Whether that is a problem you want to discuss I do not know, i.e. do you want to consider "unverisal" coders that take any type of input, or rather domain-specific algorithms. In the latter case, you have much data to guess what the distribution should be, and thus better opportunities to optimize its transmission (if you actually have to). 28. Originally Posted by biject.bwts I don't see the point of even having a header when one is doing simple entropy compression to a file. I really think adaptive is the way to go since its one pass. Its also easier to make bijective and clearly when one looks at the set of files for a unknown IID source the sum of the compressed files will be smaller than any nonbijective way where you do more than one pass and or write a header in the front of file!!!
One reason - SPEED. Show us a bijective entropy encoder running at 2-300MB/sec and maybe I'll believe it's doable, but I doubt it. Even a non-bijective adaptive compressor will have trouble with that speed.

If you encode block wise and use a header to store frequencies then you can also use tricks to normalise the sum of those frequencies to a convenient power of 2 such that the encoder replaces various operations by bit-shifting. This isn't really feasible in an adaptive system. 29. Originally Posted by JamesB If you encode block wise and use a header to store frequencies then you can also use tricks to normalise the sum of those frequencies to a convenient power of 2 such that the encoder replaces various operations by bit-shifting. This isn't really feasible in an adaptive system.
Adaptivitiy has no problems with speed. Rather the opposide: If you have to go over the data twice to collect information on the relative frequencies to built up a model fro the data, you will be slow. Adaptive methods are one-pass. JPEG LS is certainly not slow, to make one example, and its adaptive Golomb does not require a header. The MQ coder is certainly not slow, but adaptive. Adaptive Huffman is slow, and computing a Huffman table from measured frequencies is also needlessly slow.

Despite that, any type of block coding has a drawback that it cannot decorrelate data accross blocks, and it can only adapt to non-stationary sources at block boundaries. It is, for practical applications, a very limited model.

Anyhow, the original question is quite interesting once formalized correctly. 30. biject, I have missed "s" there: should be "bijection between set of combinations and {1,...,binomial(n,k)} range" - the issue this thread is about is how to store such "k", especially for large alphabet case
I don't know if "enumerative coding sucks", but some say that it can be extremely fast ( http://www.1stworks.com/ref/tr/tr04-0815b.pdf ) - if we have binary source with stationary probability distribution.

nburns, if BWT+MTF give nearly ordered distribution of some parametrized family, it indeed can be represented by the parameters.
However, there are probably some essential distortions from perfect ordering there - there appears a problem how to cheaply store such near to identity permutation...?

About nonlinear scaling of values before using linear quantization to store them, for example Huffman does something like this - store ~round(lg(p)).
I have tried to make some very brief general analysis of this kind of method in some previous post - that probabilities from some range are written as the central point of this range.
If these ranges is (r_i, r_{i+1})_i family, to make all of them have the same impact on deltaH ~ sum_s (p_s-q_s)^2/p_s approximation, we should choose r-s accordingly to recurrence:
r_{i+1} = r_i + C + sqrt(C^2 + 4 C r_i)
for some small constant C (impact to deltaH) - however, the number of such ranges turns out relatively large, like ~ 79 for deltaH~0.01, lg(79)~6.3 bits/symbol.
If you would take just some nonlinear function, it would be rather even worse.
It can be improved by using the fact that larger probabilities are less frequent - so additionally we could use entropy coder to store these quantized values.

thorfdbg, this is indeed a problem of finding optimal quantizer (with a probability distribution among quantized values), but the solution should be also practical - cannot be computationally too costly.
Another issue is what type of probability distribution should we expect? Can we assume e.g. Zipf law?
Regarding Yuriy article, he uses uniform quantization(?), while I am certain that lower probabilities should be given with better accuracy - do you disagree?

Regarding storing the order, indeed in many cases, like two-sided exponential distribution in image compression, we can neglect the difference from perfect order and just store parameters of the distribution.
However, if we rely on some general laws like Zipf's, or in methods like BWT+MTF mentioned by nburns, the disturbance from perfect ordering can be essential (e.g. single appearance of EOF symbol in MTF...) and so should be somehow stored. 31. Originally Posted by Jarek thorfdbg, this is indeed a problem of finding optimal quantizer (with a probability distribution among quantized values), but the solution should be also practical - cannot be computationally too costly.
Another issue is what type of probability distribution should we expect? Can we assume e.g. Zipf law?
Regarding Yuriy article, he uses uniform quantization(?), while I am certain that lower probabilities should be given with better accuracy - do you disagree?

Regarding storing the order, indeed in many cases, like two-sided exponential distribution in image compression, we can neglect the difference from perfect order and just store parameters of the distribution.
However, if we rely on some general laws like Zipf's, or in methods like BWT+MTF mentioned by nburns, the disturbance from perfect ordering can be essential (e.g. single appearance of EOF symbol in MTF...) and so should be somehow stored.
You seem to believe that a theoretical analysis hinders practical applicabilility. Sorry, but I disagree strongly. Before I can solve a problem by a hands-on approach, I need to understand it, otherwise I'm only fishing in the dark. To understand a problem, I use the language of maths. The result may or will require simplification and approximation, but at least it will allow me to construct something working. First make it work, then make it fast.

I don't disagree with you, neither with Yurij. If you check his article, you'll see that he solves actually a different problem, so no miracle the result is a different one. The title of this work is a bit misleading.

Concerning storing or not storing the order of the elements: There are other choices, for example adapting to the order. A classical technique is a MTF over which you could collect the statistics, and use this as the basis for the encoding. Problaby not of the current block, but of the next block. So the technique will be partially adaptive. It really depends on your design goals. 32. Thomas, I have graduated also theoretical mathematics - I understand your point, and gave you simpler equivalent formulation: e.g. find the minimal description (rate) for given maximal or expected KL distance.
However, there are many questions remaining, like what type or probability distribution are we talking about - we can think of optimal representation for given family of probability distributions, but in practical data compression we rather need one which "generally behaves well".
I value purely mathematical abstract description, but studying physics has taught me to have some distance to it - that solving real life problems is rather art of finding good approximations, or that formalization sometimes kills the essence of the problem, sometimes it's more productive to remain longer in the intuition world. And so I don't like working on purely abstract information theory description without imagining concrete constructions behind.

I have returned to Yuriy's paper - as it looks as he wants to solve our problem.
However, first he writes that rate R=lg(|Q|) bits, what means assuming uniform probability distribution among all quantizers - while large probabilities should be generally less frequent.
Then he uses uniform quantization, while smaller probabilities should be represented with better accuracy to optimize KL distance.
It is one of the problems with abstract theoretical models - they usually simplify aspects which turn out essential ...

Ok, let's try to formalize it, but I don't see how it helps(?)
Like in Yuri's paper, we need simplex of probabilities:
W = {(w_1,..,w_m)>=0: \sum_i w_i =1 }
for which we need to find a discrete subset: quantizers Q. It defines quantization function as the closest quantizer for given probability distribution: q(p) \in Q minimizes D(p||q). In 1D, values from interval go to its center. Generally we have Voronoi diagram.
We have also a probability distribution among frequencies (rho(p)) - a density function on the simplex W. It induces a probability distribution on Q - its entropy is our rate (not just lg(|Q|) like in Yuriy's paper): Pr(q)=\int_{p: q(p)=q}, R=-sum Pr(q) lg(Pr(q)).
Now the problem is e.g.: for given maximal (max_p D(p||q(p))) or expected (\int_p rho(p) D(p||q(p))) KL distance, we need to find quantization (Q) having the lowest rate. Or opposite: for given rate, find quantizer with minimal KL distance. 33. Originally Posted by Jarek nburns, if BWT+MTF give nearly ordered distribution of some parametrized family, it indeed can be represented by the parameters.
However, there are probably some essential distortions from perfect ordering there - there appears a problem how to cheaply store such near to identity permutation...?
I was wondering if such a distribution could be usefully reduced to just parameters. There will clearly be error involved, from many sources. Getting the permutation right is not really the issue. The important thing is the total magnitude of the deviation in values between model and real. In the BWT+MTF frequencies I've looked at, the orders match nicely until random noise becomes significant. The question is how much of the random noise can you leave out and still get a good fit? If you encode all of the noise, you won't save any space from transforming to a different representation

About nonlinear scaling of values before using linear quantization to store them, for example Huffman does something like this - store ~round(lg(p)).
I have tried to make some very brief general analysis of this kind of method in some previous post - that probabilities from some range are written as the central point of this range.
If these ranges is (r_i, r_{i+1})_i family, to make all of them have the same impact on deltaH ~ sum_s (p_s-q_s)^2/p_s approximation, we should choose r-s accordingly to recurrence:
r_{i+1} = r_i + C + sqrt(C^2 + 4 C r_i)
for some small constant C (impact to deltaH) - however, the number of such ranges turns out relatively large, like ~ 79 for deltaH~0.01, lg(79)~6.3 bits/symbol.
If you would take just some nonlinear function, it would be rather even worse.
It can be improved by using the fact that larger probabilities are less frequent - so additionally we could use entropy coder to store these quantized values.

thorfdbg, this is indeed a problem of finding optimal quantizer (with a probability distribution among quantized values), but the solution should be also practical - cannot be computationally too costly.
Another issue is what type of probability distribution should we expect? Can we assume e.g. Zipf law?
Regarding Yuriy article, he uses uniform quantization(?), while I am certain that lower probabilities should be given with better accuracy - do you disagree?

Regarding storing the order, indeed in many cases, like two-sided exponential distribution in image compression, we can neglect the difference from perfect order and just store parameters of the distribution.
However, if we rely on some general laws like Zipf's, or in methods like BWT+MTF mentioned by nburns, the disturbance from perfect ordering can be essential (e.g. single appearance of EOF symbol in MTF...) and so should be somehow stored.
Zipf is the quintessential distribution behind text and man-made data generally. As far as I know, you only need to ask the distribution if the data is coming from a sensor, such as audio and image data. Otherwise you can assume an underlying Zipf/power law distribution.

Good observation that Huffman transmits frequencies as lengths, which are logarithms. That's probably the right function to use. Huffman only suffers because the range of logarithms is too narrow. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•