1. ## Trend-seeking adaptation of probability distribution?

Imagine we have a single sequence of symbols - thinking about predicting probability of the next symbol basing on the previous occurrences naturally leads to a simple and elegant approach, probably(?) first documented in the 1991 Howard&Vitter paper ( ftp://ftp.cs.brown.edu/pub/techreports/91/cs91-45.pdf ), nicely described e.g. in this Fabian's post:

CDF[i] -= (CDF[i] - newCDF[i]) >> rate

where CDF[i] = sum_{j<i} Pr(j) is cumulative distribution function, directly operated on in arithmetic coding and rANS.
nawCDF is CDF from a new segment of data, used to update the CDF (every segment) used by EC.
The best we can do is getting to 1 symbol segments - update probability every symbol, what is done e.g. in LZNA.
Choosing the optimal rate is a nontrivial problem.
Here is my slide with examples for such symbol-wise adaptation:

The natural question is if it is the best we can do?

So this approach just averages over the past, it has no chance to exploit trend in the data, like rising probability of some symbol - that it might be better to assume a larger probability than average over the past.
Intuitively, we would like to "fit a first or second order polynomial" to the history of probability distribution of a given symbol and extrapolate to the current situation.
However, it is a nontrivial problem to fit a polynomial to a series of spikes (occurrences) ... but can be done.

My first approach, originally for describing density profile of a molecule as coefficients of polynomials (page 9 of https://arxiv.org/pdf/1509.09211 ), was to first find points describing the number of occurrences before some positions:
(x, number of occurrences or sum of weights before x)
and fit a polynomial to such points, then take a derivative of this polynomial - this derivative (still a polynomial) is some approximation of the density history, allows for extrapolation we need...

Later that year I have tried to use it to design trend-seeking predictors for data compression ... by the way finding a simpler and giving better behavior way:
smoothen the sample with e.g. Gaussian kernel ( https://en.wikipedia.org/wiki/Kernel_density_estimation ) and fit e.g. a polynomial to it ... it turns out that for mean-square fitting we can take the zero-width limit of the kernel ("fit to spikes": Dirac deltas), what turns out to lead to the optimal parameters.
It is extremely cheap and natural method for parametric density estimation, for example literally fitting a polynomial to a sample ... but surprisingly I couldn't find it it in the literature (?) so recently I have returned to it and submitted to arixiv yesterday: https://arxiv.org/pdf/1702.02144

However, while in theory it could be used for trend-seeking predictors for data compression ... it has a real problem as the fitted polynomial often extrapolates outside the [0,1] range for probability ...
I am looking for a practical way to repair it ... ?
One possibility is trying to enforce normalization of probability distribution to 1 ... but it is complex mathematically and still could extrapolate probability to negative values.
A looking much better possibility is using what Matt calls "squash/stretch" functions (bijection) between [0,1] and [-infinity,infinity], for example squash(x) = 1/(1 + exp(-x)) or using arcus tangens.
So instead of fitting polynomials, we should try to fit squash(polynomial) and then squash the extrapolation ... however, it currently would need a costly numerical integration for the fitting ...

I wanted to try to start discussion about improving the Howard-Vitter predictor.
Is it the only widely used adaptation?
Have you seen some approaches to improve it, maybe get some trend-seeking? Any ideas how to do it?

2. 1. "The natural question is if it is the best we can do?"

A more general form of linear counter looks like this:
p' = p*(1-wr)+(mw*bit+(1-mw)*(1-bit))*wr;
wr in [0..1) is update rate;
mw in [0..1) is "noise level";

So you don't have to use only power-of-2 rates.
Vector update can be implemented the same way though.

2. I think this may be relevant: https://encode.ru/threads/1522-Param...rate)-function

3. Instead of simple ad hoc adding of stretch/squash, I'd suggest to look at likelihoods (ie string probabilities):

3. Sure, the CDF and bitshifts I have written is just an optimized way of performing what you have written: operating on CDF removes the problem of final normalization, bitshift removes the need for multiplication.
However, your more expensive using a general parameter (wr, analogously: CDF[i] -= (CDF[i] - newCDF[i]) * wr ) allows for more flexible control of adaptation rate - that brings us to another question: how to optimally choose the adaptation rate?

I have briefly looked at the links, but I see they are about mixing - I haven't went so far yet, maybe let's focus here just on the best probability prediction for a single sequence ...

4. First link is exactly about polynomial approximation of entropy(rate) function for a linear counter.

5. Either way, its obvious that the p(wr) function for a given
bit sequence can be described as a polynomial.
For example, with p0=0.5 and bits 0110 it would be:
step 0: 0.5
step 1: 0.5 + 0.5*wr
step 2: 0.5 - 0.5*wr^2
step 3: 0.5 - 0.5*wr - 0.5*wr^2 + 0.5*wr^3
step 4: 0.5 + 1.0*wr^3 - 0.5*wr^4
etc.
Ok, so I have to admit that I don't understand - could you elaborate this example?

I wanted to fit a polynomial to local time dependence of the probability.
I know how to do it in predictive way: just looking at the past occurrences to exploit local trends - that a given symbol is e.g. rising in frequency ...
... but there is this huge problem of polynomial going out of the [0,1] range while extrapolation ...
It could be repaired by squashing first, but it makes it mathematically difficult to fit ...

6. > Ok, so I have to admit that I don't understand - could you elaborate this example?

wr is counter update rate.
Counter update is a linear function, so multiple counter updates are a polynomial in wr.
That's what you quoted.

Next step is computing a product of counter value polynomials to get a string probability,
which is equal to 2^-codelength.
This product would be also a polynomial in wr, obviously.

Thus we can process a data sample and get a polynomial describing its codelength
as a function of update rate, so we can find its maximum (as its a probability)
faster than by coding attempts with all rate values.

Its all good and everything works at this point - there're even sources and graphs.
But it only works for very limited amounts of data - up to 1000 bits at most
(not enough precision for polynomial coefs after that).

My original idea was to reduce these polynomials in some way, during updates.
I thought that coefs on one side would have exponentially lower values, so that
But it turned out that every bit of coef precision is important.
But there could be still a way to reduce them - I just did not spend much time on this.

Btw, there's also linear mixing that can be handled the same way.

> I wanted to fit a polynomial to local time dependence of the probability.

Uh, the probability doesn't depend on "time".
It depends on bit sequence.
Sure, if we'd assume that bit sequence is "0000..000" or "1111..111",
then it would seem like it depends on time, but its useless.

> I know how to do it in predictive way: just looking at the past occurrences
> to exploit local trends - that a given symbol is e.g. rising in frequency...

I'd suggest to switch to FSM counters instead, then.
You can start by taking the whole bit history as a counter state,
then merge pairs of states (choosing each time the pair which reduces
the compressed size the least, when merged)
until only required amount remains.
This would automatically "exploit local trends" etc - whatever
fits in the final state.

7. Is there any way that you could come up with a simpler problem that captures the essential elements of the bigger problem with fewer inessential details? I can't really follow you, but it seems like there could be a more general problem inside the big one, and perhaps if you could express it purely in terms of math (no compression), you could ask it on a math forum and perhaps get better results.

8. Maybe a single rate isn't even the right approach.

If we have a data source with static probabilities, then we can learn those probabilities over time as we gather more and more data, in a PPMd fashion teasing out higher-order statistics (if useful).Obviously this doesn't work when we get data sources with dynamic probabilities that change distribution over time, hence adaption. However often we have a mix of the two; eg an archive of lots of file types. For this it almost feels like we need a way of detecting that the distribution has had a sudden and wild shift, speeding up adaption at those points, but not suffering the loss in precision caused by adapting overly fast (inflating noise due to random fluctuations from subsampling a larger static probability distribution).

I've no idea how to achieve this alas! But some feedback mechanism where the adaption rate is dependent on the average recent CDF deltas may work.

Edit: this is analogous to a block-based system with static frequencies per block (zlib, zstd, etc) coupled to a block boundary detection system. I know some people have worked on fine tuning the block boundaries for better compression ratios.

9. > For this it almost feels like we need a way of detecting that the distribution has had a sudden and wild shift,

You're right, but its not so easy.
As my old post shows (Parametric-approximation-of-entropy(rate)-function),
the entropy(rate) function is a order-n^2 polynomial (n=number of processed bits), and its also very unstable -
all coefs have huge values, so change even one bit in coef values and results are wrong.

Still, it should be possible to maintain such a polynomial for a limited window (eg. 100 bits),
and use it to optimize the rate. Or just directly test re-encoding the window bits with various
rate values, which would be simpler in this case.

However, in practice, we can't afford that level of complexity, and we can't even do blockwise
optimizations, because there's a big number of counters in various contexts, and optimal rates
could be different for all of them.

Also, there's an issue with probability mapping. PM/SSE can usually significantly improve
counter predictions by sharing the information between counters in similar contexts.
But, for counters with a static update, SSE can also sometimes imitate an indirect model.
That is, with a static update, counter values correspond to specific context histories,
so SSE can learn eg. that after "0110" there's always 0 in many contexts.
But dynamic update rates kill this option.

In addition, there's a more practical method - we can simply use multiple counters with
different fixed update rates, and then mix their predictions.
In this case, the mixes with intermediate weight values seem to be good enough approximations
of intermediate rate values.

So, in conclusion, I did my share of experiments with dynamic rates, and it seems
that optimized static rates, mixes of different rates, and SSE comprensate for
the lack of dynamic rates well enough.

Well, there's also the idea of delayed counters, where you store a small
number of history bits with the counter and update it with least recent one.
In this case it becomes possible to compute a probability estimation by
updating a temp counter copy with other history bits using any variable rates.
But in practice, the multi-bit update function for a linear counter is a special
case of interpolated SSE.

Btw, one example of adaptive rate can be seen here:
https://github.com/Shelwien/ppmd_sh/...md/mod_see.inc
its a counter for PPM SEE model, and its update rate (and precision!)
change depending on escape probability value (higher precision for
lower probability).

> this is analogous to a block-based system with static frequencies per
> block (zlib, zstd, etc) coupled to a block boundary detection system. I
> know some people have worked on fine tuning the block boundaries for better
> compression ratios.

Yes. Actually bzip2 postcoder is based on this.
Also, here's Shkarin's seg_file utility, which is based on similar method:
http://ctxmodel.net/files/PPMd/segfile_sh2.rar

10. ## The Following User Says Thank You to Shelwien For This Useful Post:

JamesB (11th February 2017)

11. I think the instability you are talking about is caused by strong dependence while approaching the boundaries: 0 and 1 probability.
It could be improved by going from [0,1] to (-infinity,infinity) by squash/stretch functions, for example by fitting polynomial to stretched probabilities.

12. 1. My polynomials are precise representations of entropy(rate) function - counter update is linear and encoding is a multiplication of two polynomials.
But it basically has all data bits encoded in all coefs, so even minor approximation attempts end up causing divergences between rangecoder codelength
and one computed from this polynomial.

2. There's no probability, only update rate. We can substitute wr -> 1/(1+2^wq),
and even differentiate it by wq (we're looking for a maximum anyway) - it would only lower the polynomial order by one

3. For adaptive rate optimization it would work as it is - we won't need very large window there.
But it would be anyway both faster and better to simply mix two counters with different static update rates.

4. Counter is a model of context history. So if we want to improve it, we can simply build a more complex model,
with contexts, mixing, SSE etc. Finding a method of computing an optimal rate for a linear counter update
won't really solve anything, especially if the method is not very simple.

13. This just popped up on Hacker News and it looks relevant (link at bottom), so I thought I might as well share. The averaging function they demonstrate is
Code:
function ExpAvg(sample, avg, w) {
return w*sample + (1-w)*avg
}
Some other stuff on that site might be useful, too.

http://sam-koblenski.blogspot.com/20...mers-edge.html

14. That's exactly how "linear counters" work, eg. in lzma and bsc.
Nice blog, but I don't immediately see anything revolutionary.

15. I'm trying to follow this thread, even though some of it might be over my head...

Originally Posted by Jarek
The natural question is if it is the best we can do?

So this approach just averages over the past, it has no chance to exploit trend in the data, like rising probability of some symbol - that it might be better to assume a larger probability than average over the past.
Intuitively, we would like to "fit a first or second order polynomial" to the history of probability distribution of a given symbol and extrapolate to the current situation.
However, it is a nontrivial problem to fit a polynomial to a series of spikes (occurrences) ... but can be done.
I'm not clear on how you are extrapolating. Are you assuming that you are compressing in independent blocks and the symbol probability will evolve in roughly the same way from block to block?

Or are you expecting the symbol "bursts" to reoccur with similar shape and you are wanting to search the CDF history from previous blocks to find "analogies" to predict the next burst?

EDIT:
Or... are you postulating that the symbol density is predicted over short stretches by something like a parabola + noise, and, if you can figure out the "acceleration" parameter, you can extrapolate to the next density value? If that's the case, I'm not sure why you would postulate that about data from artificial sources. I wouldn't expect there to be differentiable functions in there.

16. nburns,
indeed I wanted to use a line or parabola to model the local trend and extrapolate it to the current position.

For example if a given symbol has appeared let say 200, 100, 50, 25, 12, 6, 3 positions ago:
- just averaging over the last 200 positions, we would get 7/200=3.5% probability,
- using p = p*(1-rate) + occurrence * rate standard adaptation for rate 0.05 leads to ~13% probability, it is average with exponentially dropping weight,
- in contrast, a trend-seeking predictor should see the growth and expect like 30% probability here.

I know how to cheaply fit a line or parabola here ( https://arxiv.org/pdf/1702.02144 + exponentially decaying weight), but it would often go below 0 or above 1.
Generally we need to be much more careful near the 0,1 values.
It should be solved by instead of fitting a polynomial, fit squash(polynomial) here, e.g. for squash(x) = 1/(1 + exp(-x)), for which linear decrease of polynomial for p~0 would mean exponential decrease of p ... but it becomes tough computationally ...

17. I think it's likely that the pattern is 'bursty'. This paper came up when I googled.

https://www.cs.cornell.edu/home/kleinber/bhs.pdf

I read a little bit of the paper. It models a bursty process as essentially a Markov process randomly choosing among different rates.

That would seem accurate for BWT data, because each suffix tree node presumably has it's own rate.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•