# Thread: Modelling... what exactly?

1. ## Modelling... what exactly?

For statistical compression, it is most common to use the probability
distribution which minimizes the codelength of already known data for coding
of next symbol.

More advanced implementations may actually take into account the next
symbol too, and minimize the codelength of already known data + next
symbol.

But either way, it seems to be common sense to think that the model in
compression predicts the next symbol, ie computes the probability
distribution of the next symbol, which is then immediately used for coding.

Now, is that actually correct?

Why, actually we have a perfect example of arithmetic coding with
completely precise modelling and coding - enumeration for a memoryless
source with known symbol counts.

It can be implemented in a way which would look exactly like a standard
statistical coder with model + AC, but it would still assign codes of
precisely the same length to all data permutations, without any redundancy
at all.

The actual coding is the same, so the model is what makes it so perfect.

Now, what's so different about the model in this case?

Well, it outputs the numbers of symbol's occurrences until the end of input,
which lets us enumerate the future strings and precisely subdivide the code
space for strings starting with each symbol.

In short, in the example where the model is known to be correct and
precise, that model actually predicts the numbers of symbol's occurrences
until the end of input, instead of chances that next symbol would be a
specific one.

So, why don't we try doing the same in a "universal" model?

Actually, knowing the number of remaining symbols in the data is quite
common (mainly because of performance considerations - checking for EOF is
much slower than storing the length and counting symbols).

Suppose that we have p=P(bit==0) and n bits left.
The probability of next n bits containing k zeroes is Binomial[n,k]*p^k*(1-p)^(n-k)
(its a sum of likelihoods of strings with k zeroes divided by sum of
likelihoods of all strings, which is 1).

We have to choose the best k value, so it has to be the most probable one.
Then let's differentiate it by k, and find the root. Its a little
complicated, but in the end Mathematica shows it as
`Log[(1-p)/p] = HarmonicNumber[n-k]-HarmonicNumber[k] // HarmonicNumber[n] = Sum[1/i,{i,1,n}]`

Now, let's compute the actual estimations: http://nishi.dreamhosters.com/u/target1.png
```f[p0_,n0_]:=k/.FindRoot[(HarmonicNumber[k]-HarmonicNumber[-k+n]+Log[(1-p)/p])/.p->p0/.n->n0,{k,p0*n0}]
Plot[f[p,100]/100-p,{p,10^-3,1-(10^-3)}]``` ```f[10/100,100] = 9.59634 // optimal k for n=100,p=10/100
f[90/100,100] = 90.4037 // optimal k for n=100,p=90/100```

And as we can see, its somewhat different from the probability.

Am I doing something wrong? 2. This is something I often thought about when looking at PAQ and possibilities to improve it. What kept me off from it was the reasoning behind that graph - it's very good to see an actual visual and mathematical representation for it instead of just skimming through it. I don't think you've done something wrong here.

What the graph shows is that "there will be exactly 49 zeroes in the next 100 bytes" is useful, but it won't really help/change prediction or compression much because the bit distribution in the codespace left is only skewed a bit. But as you get very close to the extremes (near 0 or 100 zeroes), the skew gets stronger and it will actually help by sorting out predictions that aren't possible anymore because "you don't have enough 0 or 1 bits left".

I never really tried to implement something like this because I thought it would end up in another typical tradeoff - you'll have to store the symbol counts in some way, too, and the cases where you actually compress better will likely be edge cases. Also, the graphes for larger n will likely approach a complete straight line, so you'll have to count symbols for small blocks instead of the whole file to get better results. On the other hand, in programs like PAQ, symbol counts might have a much stronger effect for individual models as they might have only a few of 10000 cases left that contain exactly 49 zeroes (PAQ might be a bad example as its models usually predict one bit instead of the next 100, but I think you get the idea). 3. The Binomial Distribution assumes a sampling with replacement. Try a Hypergeometric Distribution instead. 4. Binomial[n,k] is Mathematica syntax for n!/k!/(n-k)!
and I used it to compute the probability of a bit string of length n with k zeroes, given p=P(bit==0).

Also the topic is the meaning of rangecoder's inputs and whether the probability of getting p*n zeroes
among next n bits is the same as the probability p of single next bit being zero. 5. Binomial[n,k]*p^k*(1-p)^(n-k)

is the Binomial Distribution and it assumes that you draw k values out of n values but put them back, which is obviously not what you are doing. 6. Ah. But I'm not sampling Binomial[n,k] strings, I'm computing a probability of a single sampled string being one of these. 7. but you're sampling Binomial[n,k] bits out of a string of length n

Suppose you have a string of 100 bits where exactly 10 bits are "0".

your model (as I understand it):
pick a bit, look at it, put it back again, do this 100 times
the probability of 10 bits being "0" is: Binomial[100,10] * 0.1^(10)*(0.9)^90=~0.13

hypergeometric model:
pick a bit, put it away, do this 100 times
the probability of 10 bits being "0" is: Binomial[10,10]*Binomial[90,0]/Binomial[100, 100]=1

sure, the difference between the two models gets smaller as the total length increases. For example if you sample this 100 Bits out of 1000 Bits the Hypergeometric Model gets also around ~0.13. 8. http://en.wikipedia.org/wiki/Combination
There're 2^n strings total, and Binomial[n,k] of them have k zeroes.

Also, there're Binomial[n-1,k-1] strings which start with a zero,
and Binomial[n-1,k] strings which start with a one, so we can
implement a coder like this:
Code:
```range = Binomial[n,k];
low = 0; n0=0;
for( i=0; i<n; i++ ) {
low += (bit[i]==1) ? range*(k-n0)/(n-i) : 0;
range *= (bit[i]==0) ? (k-n0) : (n-i)-(k-n0)
range /= (n-i);
n0 += (bit[i]==0);
}```
which would directly convert a (n;k)-string to its number, without any redundancy.

As you can see, this is basically a simplified rangecoder without renormalization.
And I'm wondering whether we're right to use a bit's probability estimation for coding,
while actually it has to be the number of remaining bits.

Example: suppose the bit freqs are stored before the string.
Then, at some point we would know that there're no more zeroes,
while an adaptive model would still predict zeroes with a high probability.
How do we determine a point where decreasing counts become more
important than an extra order0 model which can be mixed in? 9. It doesn't matter how you construct the probability, and i think I understand it good enough. Combinatorics are somewhat restricted if it comes to priors and insight on the "stochastic process" behind the combinatorical modelling.

I think you wonder why 0.13 != 0.10 or the other way round with the k-value. 10. Ok, there's an unknown bitstring of length n.
We know that a probability of each bit in that string being zero is p.
What's the probability of that string having k zeroes?

I'm talking about the fact, that, given a bit probability estimation p,
we can go further and compute the probability of encountering x=0..n zeroes
within remaining n bits of data, then feed k/n to rangecoder, where
k is the most likely number of zeroes. 11. Originally Posted by Shelwien And I'm wondering whether we're right to use a bit's probability estimation for coding,
while actually it has to be the number of remaining bits.
The probability estimation is done via bernoulli distribution, successive bernoulli-trials form a binomial distribution so this is basically the same thing.
Knowing the total number of bits makes no difference to the classical method (because you draw with replacement).

Example: suppose the bit freqs are stored before the string.
Then, at some point we would know that there're no more zeroes,
while an adaptive model would still predict zeroes with a high probability.
How do we determine a point where decreasing counts become more
important than an extra order0 model which can be mixed in?
I consider knowing the only total of 0 and 1 useless. There should be freqs for many contexts. You could then mix the semi-static model with the adaptive one. Originally Posted by Shelwien Ok, there's an unknown bitstring of length n.
We know that a probability of each bit in that string being zero is p.
What's the probability of that string having k zeroes?
It depends on how you look at that string - either binomial or hypergeometric.
Hypergeometric: the probability of a 100 bits string with p=0.1 having 10 Zeros is 1!
Binomial: the probability of a 100 bits string with p=0.1 having 10 Zeros is ~0.13

I'm talking about the fact, that, given a bit probability estimation p,
we can go further and compute the probability of encountering x=0..n zeroes
within remaining n bits of data, then feed k/n to rangecoder, where
k is the most likely number of zeroes.
Yes I understand, but you're doing nothing different than probability estimation with bernoulli distribution. 12. > Knowing the total number of bits makes no difference to the classical method

It does, because P(bit==0)=0.99 can't be correct when there're only 10 remaining bits.
The right probability can be 0.9 or 1, but not 0.99.

And its possible to encode lengths for relatively small blocks without any noticeable redundancy,
so the difference is not so theoretical.

> I consider knowing the only total of 0 and 1 useless.

Its not useless if there're still 100 bits to decode and p0=0.99, but zero counter already expired. 13. Quick question : where does p comes from ?
Is that a prior pass, with statistics, or a guesstimation using any kind of modeler ?

Also, knowing n is trivial for an order-0 coder, but what about order-n ?
How to know n for each context ? 14. > Quick question : where does p comes from ?
> Is that a prior pass, with statistics, or a guesstimation using any kind of modeler ?

The usual bit probability estimation from a usual model.
I'm discussing whether we actually need a different kind of input
for arithmetic coder, which is only approximately similar to that probability.

So I'm building a secondary model for that "different kind of input", based
on known bit probability estimation, but obviously its not the only way
to make such a model, and hardly the most precise one.

> Also, knowing n is trivial for an order-0 coder, but what about order-n ?
> How to know n for each context ?

For example, we can collect statistics on distances between context occurrences
and extrapolate it to compute remaining occurrences (knowing the total file size).

Anyway, it could get fairly complicated... let's find out first whether
it makes any sense for order0 :) 15. Originally Posted by Shelwien > Knowing the total number of bits makes no difference to the classical method
It does, because P(bit==0)=0.99 can't be correct when there're only 10 remaining bits.
The right probability can be 0.9 or 1, but not 0.99.
As I wrote, it makes no difference with your method. You must use the hypergeometric model instead.

Its not useless if there're still 100 bits to decode and p0=0.99, but zero counter already expired.
Maybe for very small blocks. 16. > You must use the hypergeometric model instead.

As I said, I'm not sampling the strings from a pool, there's only one string,
so the distribution is binomial.
I'm talking about the probability of next bit being zero vs number of zeroes among remaining bits.

> Maybe for very small blocks.

As I said, any data can be split into small blocks - in fact many popular compressors like rar and zip
do just that. 17. Consider a string x of length n bits with k bits equal to 0. Then p(x) = p(n) p(k|n) p(x|k,n). If the bits of x are i.i.d. then p(x|k,n) is just the binomial model described above. But you still have to model n and k. So given an ideal range coder, I don't see where there would be any coding savings. 18. > But you still have to model n and k

It can be assumed that we know n (filesize can be stored separately),
and we can even make it reasonably small.
And as to modelling k - yes, that's exactly what I'm suggesting here.
I tried to compute the maximum-likelihood estimation for k, and k/n seems to
be slightly different from p. However it looks like a linear mix with 0.5, except on edges,
and such a mixing with a tuned weight would commonly improve the compression,
so its a little hard to test my theory, as we won't know whether the compression
improvement comes from our misinterpretation of rc inputs or simply from the noise in the data. 19. The probability model is naturally for an infinite string, you could consider the joint set of all used input-blocks (seperate files over lifetime) as an "infinite" string. That we consider throwing away the statistics of the previously coded files has practical reasons. The result is some sort of block-step non-stationary modeling, from the PoV of the model. Like, everytime you move to another place, you force yourself to forget your past. But this is an imposed behaviour, for practical reasons, not because it's a property of the model.

If you wouldn't reset the model, then the previous file would pay the (bit-)price the next file would not. And nothing gets lost.

Your combinatorical model works only on finite strings, if you'd use a very long finite string you'd see the two models converge.

Interestingly, for long strings the overhead of n & k becomes neglible, so does the combinatorical approach gain. For short strings n & k become much larger (vs. string length), so does the combinatorical approach gain. Looks like you're just re-distributing bits. 20. Originally Posted by Matt Mahoney If the bits of x are i.i.d. then p(x|k,n) is just the binomial model described above.
The attempt of using the number of "0" and "1" bits to make better estimations makes it non i.i.d.

You're right about the binomial model. But the binomial model is just the compound process of the bernoulli model (which every cm uses). Both ML-estimations for p are exactly the same. Originally Posted by Shelwien
I tried to compute the maximum-likelihood estimation for k, and k/n seems to
be slightly different from p.
I still don't get your problem, because i don't know which p you mean.
if f[p,k,n] is the binomial model from above than it sounds like you "hope" that f[0.1,10,100]=0.10?
It also could be that your estimator for k is biased but you can check that. 21. @Ethatron:

> The probability model is naturally for an infinite string

Both modelling precision and coding precision are limited,
so its far from natural.

> throwing away the statistics of the previously coded files has practical reasons

I'm not talking about solid compression here, but its actually helpful.
Frequently developers don't consider the nature of contexts when initializing the statistics,
and then any actual stats appear better than initial ones.
For example, this applies to lzma's masked literal model.

> Your combinatorical model works only on finite strings, if you'd use
> a very long finite string you'd see the two models converge.

Yes, in theory. But with imprecise models/coders and nonstationary data
its not so simple.

> Interestingly, for long strings the overhead of n & k becomes neglible,
> so does the combinatorical approach gain

Its actually the opposite, in fact.
Because of coding overhead, already at 1000 symbols explicit coding of
data size becomes better than EOF in the alphabet (same with EOF flags).

Anyway, this is actually all about "probability extrapolation".
First there appeared a reason to apply sq(C*st(p)) transformation
to model predictions (to convert a prior probability estimation to posterior),
and now I found another potential candidate.

It may be hard to provide theoretical foundations for various parameter
values etc, when it involves various overheads of practical implementations,
but its easy to see the improvements in practice.

@Sebastian:

> I still don't get your problem, because i don't know which p you mean.

In this thread, I'm arguing that the actual inputs for the rangecoder
are symbol counts, not the probability.
Thus it may be possible that a model which would predict optimal symbol counts
could do better than a (usual) model which predicts the probability distribution of
the next bit.
I'm only comparing p with k/n (if its what you meant in the quote) to show
that with different models there would be different rangecoder inputs
(thus compression would be also different).
Also we don't have to predict n, because its easier to assume that its known -
there's no coding overhead from that for practical reasons.

In fact we could store k too - how to make use of that additional information,
in conjunction with usual context model, is also an interesting question.
Btw, one possible answer is to predict k using context model and mix it with side info.

> if f[p,k,n] is the binomial model from above than it sounds like you "hope" that f[0.1,10,100]=0.10?

f[p,n] is k, where k is a maximum likelihood parameter for binomial model, given p and n.

> It also could be that your estimator for k is biased but you can check that.

That's actually what I meant with "doing something wrong".
Mathematica defines the harmonic numbers for fractional n (actually for complex n).
So atm I'm not sure whether its ok to use fractional k's provided by f[p,n] or
maybe its better to explicitly compare the likelihoods for various integer k's.

Anyway, it seemed like an interesting topic to discuss  22. > > Interestingly, for long strings the overhead of n & k becomes neglible,
> > so does the combinatorical approach gain

> Its actually the opposite, in fact.

I think what I said might be a a bit convoluted:
Sending size+zeroes requires a few bits (sn&sk). That bits vs. the bits of a very long string is neglible: ilength/(olength+sn+sk) ~= ilength/olength. I meant the overhead of sending n&k, not any overhead in the coding utilizing n&k.

Anyway. It may be interesting to look at this kind of file and look what happens:

1111...11110000...0000

The probability models starts and ends with p=0.5, your combinatoric model would stop producing anything at all when it reaches the first 0 - the last of any of either of the two bits in general. And in between? 23. > I meant the overhead of sending n&k

Its necessary to encode these values anyway, explicitly or implicitly - there're no infinite streams in practice.
And then explicit coding is usually better, because overhead of various kinds of unary coding is too high.

> The probability models starts and ends with p=0.5

An adaptive model won't.

> your combinatoric model would stop producing anything at all

Not sure what you mean here. The binomial model from the first post is based on a usual model (which provides p).
But if we'd explicitly store counts of zeroes and ones, then yeah, there'd be nothing to decode when any of the counts becomes zero.

> And in between?

Well, we'd have a probability estimation p=n0/(n0+n1) and update (bit ? n1-- : n0--).
When n0 and n1 are both far from zero, it can be used like any other model - mixed with other estimations etc. 24. Originally Posted by Ethatron > > Interestingly, for long strings the overhead of n & k becomes neglible,
> > so does the combinatorical approach gain

> Its actually the opposite, in fact.

I think what I said might be a a bit convoluted:
Sending size+zeroes requires a few bits (sn&sk). That bits vs. the bits of a very long string is neglible: ilength/(olength+sn+sk) ~= ilength/olength. I meant the overhead of sending n&k, not any overhead in the coding utilizing n&k.

Anyway. It may be interesting to look at this kind of file and look what happens:

1111...11110000...0000

The probability models starts and ends with p=0.5, your combinatoric model would stop producing anything at all when it reaches the first 0 - the last of any of either of the two bits in general. And in between?

Actaully if you look at the work of Howard and Vitter when you do a simple adaptive arithmetic , OR if you store the number of
ones and zeros and then use the exact probability for basis using the exact number of ones and zeroes remaining you get the
same compression. Having all ones followed by all zeroes really makes no difference. The compression is the same regardless
of the permutation of the ones and zeroes its a function of the actual number of ones and zeroes. And there many ways
to do the same thing. Actually when you write a real bijective arithmetic compressor based on any of these methods you
end of computing something like say 100.625 bits. What this really means is that the compression is not really to the
same number but some will be 100 bits while others would be 101 bits you have to map the fraction of bits to 2 different
values.

When I wrote ARB255 the simple arithmetic compressor I actually compress a sting that is less than the number of bytes
I suspect there maybe cases because of the endings I chose to use where if you look at some files and start compressing all permutations of the file you end up with 3 values say N N+1 and N+2 However I don't remember ever getting more than two
values N and N+1 for various permutations of a file. ARB255 was for simple bijective adaptive compression of IID files of up to 256 symbols. I plan to update the code one of these days and post it here. 25. Originally Posted by Shelwien > The probability models starts and ends with p=0.5

An adaptive model won't.
It will. 1/2,2/3,3/4,4/5,5/6,6/7,7/8,7/9,7/10,7/11,7/12,7/13,7/14 counts/total for 1s, p=0.5 in the beginning and the end. My example is intentionally symmetric. Originally Posted by Shelwien > your combinatoric model would stop producing anything at all

Not sure what you mean here. The binomial model from the first post is based on a usual model (which provides p).
But if we'd explicitly store counts of zeroes and ones, then yeah, there'd be nothing to decode when any of the counts becomes zero.

> And in between?

Well, we'd have a probability estimation p=n0/(n0+n1) and update (bit ? n1-- : n0--).
When n0 and n1 are both far from zero, it can be used like any other model - mixed with other estimations etc.
7/14,6/13,5/12,4/11,3/10,2/9,1/8,0/6,0/5,0/4,0/3,0/2,0/1 counts/total for 1s then. The zero probability is not a problem as no 1s are coded anymore (no collapse-interval-to-0), and 0s are coded on the full interval [0,total,total] (low, high, total).

I would say the last "1" pays the price the remaining "0"s don't. The more likely the event of last-symbol-of-it's-kind the worst the probability of that symbol becomes. But's a correct estimation, nothing to complain.  #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•